Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
On 2017-08-30 12:52:26 +0900, Michael Paquier wrote: > On Wed, Aug 30, 2017 at 11:20 AM, Peter Eisentraut >wrote: > > On 8/29/17 20:36, Andres Freund wrote: > >> So the question is whether we want {max,min}_wal_size be sized in > >> multiples of segment sizes or as a proper byte size. I'm leaning > >> towards the latter. > > > > I'm not sure what the question is or what its impact would be. > > FWIW, I get the question as: do we want the in-memory values of > min/max_wal_size to be calculated in MB (which is now the case) or > just bytes. Andres tends for using bytes. Not quite. There's essentially two things: 1) Currently the default for {min,max}_wal_size depends on the segment size. Given that the segment size is about to be configurable, that seems confusing. 2) Currently wal_segment_size is measured in GUC_UNIT_XBLOCKS, which requires us to keep two copies of the underlying variable, one in XBLOCKS one in bytes. I'd rather just have the byte variant. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Tue, Aug 29, 2017 at 7:22 PM, Thomas Munrowrote: > Indeed. Thank you for working on this! To summarise a couple of > ideas that Peter and I discussed off-list a while back: (1) While > building the hash table for a hash join we could build a Bloom filter > per future batch and keep it in memory, and then while reading from > the outer plan we could skip writing tuples out to future batches if > there is no chance they'll find a match when read back in later (works > only for inner joins and only pays off in inverse proportion to the > join's selectivity); Right. Hash joins do tend to be very selective, though, so I think that this could help rather a lot. With just 8 or 10 bits per element, you can eliminate almost all the batch write-outs on the outer side. No per-worker synchronization for BufFiles is needed when this happens, either. It seems like that could be very important. > To use this for anything involving parallelism where a Bloom filter > must be shared we'd probably finish up having to create a shared > version of bloom_init() that either uses caller-provided memory and > avoids the internal pointer, or allocates DSA memory. I suppose you > could consider splitting your bloom_init() function up into > bloom_estimate() and bloom_init(user_supplied_space, ...) now, and > getting rid of the separate pointer to bitset (ie stick char > bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)? Makes sense. Not hard to add that. > I was just observing that there is an opportunity for code reuse here. > This function's definition would ideally be something like: > > double > bloom_prop_bits_set(bloom_filter *filter) > { > return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter); > } > > This isn't an objection to the way you have it, since we don't have a > popcount function yet! We have several routines in the tree for > counting bits, though not yet the complete set from Hacker's Delight. Right. I'm also reminded of the lookup tables for the visibility/freeze map. > Your patch brings us one step closer to that goal. (The book says > that this approach is good far sparse bitsets, but your comment says > that we expect something near 50%. That's irrelevant anyway since a > future centralised popcount() implementation would do this in > word-sized chunks with a hardware instruction or branch-free-per-word > lookups in a table and not care at all about sparseness.) I own a copy of Hacker's Delight (well, uh, Daniel Farina lent me his copy about 2 years ago!). pop()/popcount() does seem like a clever algorithm, that we should probably think about adopting in some cases, but I should point at that the current caller to my bloom_prop_bits_set() function is an elog() DEBUG1 call. This is not at all performance critical. > + * Test if bloom filter definitely lacks element. > > I think where "Bloom filter" appears in prose it should have a capital > letter (person's name). Agreed. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
On Wed, Aug 30, 2017 at 11:20 AM, Peter Eisentrautwrote: > On 8/29/17 20:36, Andres Freund wrote: >> So the question is whether we want {max,min}_wal_size be sized in >> multiples of segment sizes or as a proper byte size. I'm leaning >> towards the latter. > > I'm not sure what the question is or what its impact would be. FWIW, I get the question as: do we want the in-memory values of min/max_wal_size to be calculated in MB (which is now the case) or just bytes. Andres tends for using bytes. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On Tue, Aug 29, 2017 at 10:36 PM, Amit Langotewrote: >> I keep having the feeling that this is a big patch with a small patch >> struggling to get out. Is it really necessary to change >> RelationGetPartitionDispatchInfo so much or could you just do a really >> minimal surgery to remove the code that sets the stuff we don't need? >> Like this: > > Sure, done in the attached updated patch. On first glance, that looks pretty good. I'll have a deeper look tomorrow. It strikes me that if PartitionTupleRoutingInfo is an executor structure, we should probably (as a separate patch) relocate FormPartitionKeyDatum and get_partition_for_tuple to someplace in src/backend/executor and rename the accordingly; maybe it's time for an execPartition.c? catalog/partition.c is getting really bug, so it's not a bad thing if some of that stuff gets moved elsewhere. A lingering question in my mind, though, is whether it's really correct to think of PartitionTupleRoutingInfo as executor-specific. Maybe we're going to need that for other things too? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 2017/08/29 4:26, Robert Haas wrote: > I think this patch could be further simplified by continuing to use > the existing function signature for RelationGetPartitionDispatchInfo > instead of changing it to return a List * rather than an array. I > don't see any benefit to such a change. The current system is more > efficient. OK, restored the array way. > I keep having the feeling that this is a big patch with a small patch > struggling to get out. Is it really necessary to change > RelationGetPartitionDispatchInfo so much or could you just do a really > minimal surgery to remove the code that sets the stuff we don't need? > Like this: Sure, done in the attached updated patch. Thanks, Amit From 9dd8e6f6bd3636f8c125c71e6d1c65bf606a2a22 Mon Sep 17 00:00:00 2001 From: amitDate: Wed, 30 Aug 2017 10:02:05 +0900 Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor Currently it and the structure it generates viz. PartitionDispatch objects are too coupled with the executor's tuple-routing code. In particular, it's pretty undesirable that it makes it the responsibility of the caller to release some resources, such as executor tuple table slots, tuple-conversion maps, etc. That makes it harder to use in places other than where it's currently being used. After this refactoring, ExecSetupPartitionTupleRouting() now needs to do some of the work that was previously done in RelationGetPartitionDispatchInfo(). --- src/backend/catalog/partition.c| 53 +++- src/backend/commands/copy.c| 32 +++-- src/backend/executor/execMain.c| 88 ++ src/backend/executor/nodeModifyTable.c | 37 +++--- src/include/catalog/partition.h| 20 +++- src/include/executor/executor.h| 4 +- src/include/nodes/execnodes.h | 40 +++- 7 files changed, 181 insertions(+), 93 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 96a64ce6b2..c92756ecd5 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1081,7 +1081,6 @@ RelationGetPartitionDispatchInfo(Relation rel, Relationpartrel = lfirst(lc1); Relationparent = lfirst(lc2); PartitionKey partkey = RelationGetPartitionKey(partrel); - TupleDesc tupdesc = RelationGetDescr(partrel); PartitionDesc partdesc = RelationGetPartitionDesc(partrel); int j, m; @@ -1089,29 +1088,12 @@ RelationGetPartitionDispatchInfo(Relation rel, pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData)); pd[i]->reldesc = partrel; pd[i]->key = partkey; - pd[i]->keystate = NIL; pd[i]->partdesc = partdesc; if (parent != NULL) - { - /* -* For every partitioned table other than root, we must store a -* tuple table slot initialized with its tuple descriptor and a -* tuple conversion map to convert a tuple from its parent's -* rowtype to its own. That is to make sure that we are looking at -* the correct row using the correct tuple descriptor when -* computing its partition key for tuple routing. -*/ - pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc); - pd[i]->tupmap = convert_tuples_by_name(RelationGetDescr(parent), - tupdesc, - gettext_noop("could not convert row type")); - } + pd[i]->parentoid = RelationGetRelid(parent); else - { - /* Not required for the root partitioned table */ - pd[i]->tupslot = NULL; - pd[i]->tupmap = NULL; - } + pd[i]->parentoid = InvalidOid; + pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int)); /* @@ -1860,7 +1842,7 @@ generate_partition_qual(Relation rel) * Construct values[] and isnull[] arrays for the partition key * of a tuple. * - * pd Partition dispatch object of the partitioned table + * ptrinfo PartitionTupleRoutingInfo object of the table * slotHeap tuple from which to extract partition key * estate executor state for evaluating any partition key *
Re: [HACKERS] pgbench: Skipping the creating primary keys after initialization
On Tue, Aug 29, 2017 at 4:47 PM, Fabien COELHOwrote: > > Hello, > > Patch applies and works. > >>> I would like to insist a little bit: the name was declared as an int but >>> passed to init and used as a bool there before the patch. Conceptually >>> what >>> is meant is really a bool, and I see no added value bar saving a few >>> strokes >>> to have an int. ISTM that recent pgbench changes have started turning old >>> int-for-bool habits into using bool when bool is meant. >> >> >> Since is_no_vacuum is a existing one, if we follow the habit we should >> change other similar variables as well: is_init_mode, do_vacuum_accounts and >> debug. And I think we should change them in a separated patch. > > > Hmmm. The existing "is_no_vacuum" variable is typed *both* as int (in > "main") and as bool (in "init"), called by main (yuk!). I see no reason to > choose the bad one for the global:-) > Yeah, I think this might be a good timing to re-consider int-for-bool habits in pgbench. If we decided to change is_no_vacuum to bool I want to change other similar variables as well. >> After more thought, I'm bit inclined to not have a short option for >> --custom-initialize because this option will be used for not primary >> cases. It would be better to save short options for future >> enhancements of pgbench. Thought? > > > I like it as is, especially as now the associated value is a simple and > short string, I think that it makes sense to have a simple and short option > to trigger it. Moreover -I stands cleanly for "initialization", and the > capital stands for something a little special which it is. Its to good to > miss. > > I think that the "-I" it should be added to the "--help" line, as it is done > with other short & long options. Okay, I'll leave it as of now. Maybe we can discuss later. > > Repeating "-I f" results in multiple foreign key constraints: > > Foreign-key constraints: > "pgbench_tellers_bid_fkey" FOREIGN KEY (bid) REFERENCES > pgbench_branches(bid) > "pgbench_tellers_bid_fkey1" FOREIGN KEY (bid) REFERENCES > pgbench_branches(bid) > "pgbench_tellers_bid_fkey2" FOREIGN KEY (bid) REFERENCES > pgbench_branches(bid) > "pgbench_tellers_bid_fkey3" FOREIGN KEY (bid) REFERENCES > pgbench_branches(bid) > > I wonder if this could be avoided easily? Maybe by setting the constraint > name explicitely so that the second one fails on the existing one, which is > fine, like for primary keys? Or adding a DROP CONSTRAINT IF EXISTS before > the CREATE CONSTRAINT, like for tables? Or doing nothing about it? I would > prefer the first option. Good point, I agree with first option. > > Maybe the initial cleanup (DROP TABLE) could be made an option added to the > default, so that cleaning up the database could be achieved with some > "pgbench -i -I c", instead of connecting and droping the tables one by one > which I have done quite a few times... What do you think? Yeah, I sometimes wanted that. Having the cleaning up tables option would be good idea. > Before it is definitely engraved, I'm thinking about the letters: > > c - cleanup > t - create table > d - data > p - primary key > f - foreign key > v - vacuum > > I think it is mostly okay, but it is the last time to think about it. Using > "d" for cleanup (drop) would mean finding another letter for filling in > data... maybe "g" for data generation? "c" may have been chosen for "create > table", but then would not be available for "cleanup". Thoughts? I'd say "g" for data generation would be better. Also, I'm inclined to add a command for the unlogged tables. How about this? c - cleanup t - create table u - create unlogged table g - data generation p - primary key f - foreign key v - vacuum Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Aug 30, 2017 at 1:00 PM, Peter Geogheganwrote: > On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro > wrote: >> Some drive-by comments on the lib patch: > > I was hoping that you'd look at this, since you'll probably want to > use a bloom filter for parallel hash join at some point. I've tried to > keep this one as simple as possible. I think that there is a good > chance that it will be usable for parallel hash join with multiple > batches. You'll need to update the interface a little bit to make that > work (e.g., bring-your-own-hash interface), but hopefully the changes > you'll need will be limited to a few small details. Indeed. Thank you for working on this! To summarise a couple of ideas that Peter and I discussed off-list a while back: (1) While building the hash table for a hash join we could build a Bloom filter per future batch and keep it in memory, and then while reading from the outer plan we could skip writing tuples out to future batches if there is no chance they'll find a match when read back in later (works only for inner joins and only pays off in inverse proportion to the join's selectivity); (2) We could push a Bloom filter down to scans (many other databases do this, and at least one person has tried this with PostgreSQL and found it to pay off[1]). To use this for anything involving parallelism where a Bloom filter must be shared we'd probably finish up having to create a shared version of bloom_init() that either uses caller-provided memory and avoids the internal pointer, or allocates DSA memory. I suppose you could consider splitting your bloom_init() function up into bloom_estimate() and bloom_init(user_supplied_space, ...) now, and getting rid of the separate pointer to bitset (ie stick char bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)? >> +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len) >> >> I think the plan is to use size_t for new stuff[1]. > > I'd forgotten. > >> This is another my_log2(), right? >> >> It'd be nice to replace both with fls() or flsl(), though it's >> annoying to have to think about long vs int64 etc. We already use >> fls() in two places and supply an implementation in src/port/fls.c for >> platforms that lack it (Windows?), but not the long version. > > win64 longs are only 32-bits, so my_log2() would do the wrong thing > for me on that platform. pow2_truncate() is provided with a number of > bits as its argument, not a number of bytes (otherwise this would > work). Hmm. Right. > Ideally, we'd never use long integers, because its width is platform > dependent, and yet it is only ever used as an alternative to int > because it is wider than int. One example of where this causes > trouble: logtape.c uses long ints, so external sorts can use half the > temp space on win64. Agreed, "long" is terrible. >> +bloom_prop_bits_set(bloom_filter *filter) >> +{ >> +intbitset_bytes = NBITS(filter) / BITS_PER_BYTE; >> +int64bits_set = 0; >> +inti; >> + >> +for (i = 0; i < bitset_bytes; i++) >> +{ >> +unsigned char byte = filter->bitset[i]; >> + >> +while (byte) >> +{ >> +bits_set++; >> +byte &= (byte - 1); >> +} >> +} > > I don't follow what you mean here. I was just observing that there is an opportunity for code reuse here. This function's definition would ideally be something like: double bloom_prop_bits_set(bloom_filter *filter) { return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter); } This isn't an objection to the way you have it, since we don't have a popcount function yet! We have several routines in the tree for counting bits, though not yet the complete set from Hacker's Delight. Your patch brings us one step closer to that goal. (The book says that this approach is good far sparse bitsets, but your comment says that we expect something near 50%. That's irrelevant anyway since a future centralised popcount() implementation would do this in word-sized chunks with a hardware instruction or branch-free-per-word lookups in a table and not care at all about sparseness.) + * Test if bloom filter definitely lacks element. I think where "Bloom filter" appears in prose it should have a capital letter (person's name). [1] http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
On 8/29/17 20:36, Andres Freund wrote: > So the question is whether we want {max,min}_wal_size be sized in > multiples of segment sizes or as a proper byte size. I'm leaning > towards the latter. I'm not sure what the question is or what its impact would be. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Update low-level backup documentation to match actual behavior
On Tue, Aug 29, 2017 at 10:59 PM, David Steelewrote: > Hi Robert, > > On 8/25/17 4:03 PM, David Steele wrote: >> On 8/25/17 3:26 PM, Robert Haas wrote: >>> On Fri, Aug 25, 2017 at 3:21 PM, David Steele >>> wrote: No problem. I'll base it on your commit to capture any changes you made. >>> >>> Thanks, but you incorporated everything I wanted in response to my >>> first review -- so I didn't tweak it any further. >> >> Thank you for committing that. I'll get the 9.6 patch to you early next >> week. > > Attached is the 9.6 patch. It required a bit more work in func.sgml > than I was expecting so have a close look at that. The rest was mostly > removing irrelevant hunks. + switch to the next WAL segment. On a standby, it is not possible to + automatically switch WAL segments, so you may wish to run + pg_switch_wal on the primary to perform a manual + switch. The reason for the switch is to arrange for [...] +WAL segments have been archived. If write activity on the primary is low, it +may be useful to run pg_switch_wal on the primary in order to +trigger an immediate segment switch of the last required WAL It seems to me that both portions are wrong. There is no archiving wait on standbys for 9.6, and pg_stop_backup triggers by itself the segment switch, so saying that enforcing pg_switch_wal on the primary is moot. pg_switch_xlog has been renamed to pg_switch_wal in PG10, so the former name applies. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
On 2017-08-30 10:14:22 +0900, Michael Paquier wrote: > On Wed, Aug 30, 2017 at 10:06 AM, Andres Freundwrote: > > On 2017-08-30 09:49:14 +0900, Michael Paquier wrote: > >> Do you think that we should worry about wal segment sizes higher than > >> 2GB? Support for int64 GUCs is not here yet. > > > > 1GB will be the limit anyway. > > Yeah, but imagine that we'd want to raise that even more up. I'm doubtfull of that. But even if, it'd not be hard to GUC support. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
On Wed, Aug 30, 2017 at 10:06 AM, Andres Freundwrote: > On 2017-08-30 09:49:14 +0900, Michael Paquier wrote: >> Do you think that we should worry about wal segment sizes higher than >> 2GB? Support for int64 GUCs is not here yet. > > 1GB will be the limit anyway. Yeah, but imagine that we'd want to raise that even more up. By using bytes, int64 GUCs become mandatory, or we'd come back into using MBs for this parameter, bringing us back to the original implementation. That's something worth thinking about. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
On 2017-08-30 09:49:14 +0900, Michael Paquier wrote: > On Wed, Aug 30, 2017 at 9:36 AM, Andres Freundwrote: > > So the question is whether we want {max,min}_wal_size be sized in > > multiples of segment sizes or as a proper byte size. I'm leaning > > towards the latter. > > Logically in the code it is just a matter of adjusting multipliers. No code difficulties here, I think we just need to decide what we want. > Do you think that we should worry about wal segment sizes higher than > 2GB? Support for int64 GUCs is not here yet. 1GB will be the limit anyway. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munrowrote: > Some drive-by comments on the lib patch: I was hoping that you'd look at this, since you'll probably want to use a bloom filter for parallel hash join at some point. I've tried to keep this one as simple as possible. I think that there is a good chance that it will be usable for parallel hash join with multiple batches. You'll need to update the interface a little bit to make that work (e.g., bring-your-own-hash interface), but hopefully the changes you'll need will be limited to a few small details. > +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len) > > I think the plan is to use size_t for new stuff[1]. I'd forgotten. > This is another my_log2(), right? > > It'd be nice to replace both with fls() or flsl(), though it's > annoying to have to think about long vs int64 etc. We already use > fls() in two places and supply an implementation in src/port/fls.c for > platforms that lack it (Windows?), but not the long version. win64 longs are only 32-bits, so my_log2() would do the wrong thing for me on that platform. pow2_truncate() is provided with a number of bits as its argument, not a number of bytes (otherwise this would work). Ideally, we'd never use long integers, because its width is platform dependent, and yet it is only ever used as an alternative to int because it is wider than int. One example of where this causes trouble: logtape.c uses long ints, so external sorts can use half the temp space on win64. > +/* > + * Hash function is taken from sdbm, a public-domain reimplementation of the > + * ndbm database library. > + */ > +static uint32 > +sdbmhash(unsigned char *elem, Size len) > +{ > I see that this is used in gawk, BerkeleyDB and all over the place[2]. > Nice. I understand that this point of this is to be a hash function > that is different from our usual one, for use by k_hashes(). Right. It's only job is to be a credible hash function that isn't derivative of hash_any(). > Do you > think it belongs somewhere more common than this? It seems a bit like > our hash-related code is scattered all over the place but should be > consolidated, but I suppose that's a separate project. Unsure. In its defense, there is also a private murmurhash one-liner within tidbitmap.c. I don't mind changing this, but it's slightly odd to expose a hash function whose only job is to be completely unrelated to hash_any(). > Unnecessary braces here and elsewhere for single line body of for loops. > > +bloom_prop_bits_set(bloom_filter *filter) > +{ > +intbitset_bytes = NBITS(filter) / BITS_PER_BYTE; > +int64bits_set = 0; > +inti; > + > +for (i = 0; i < bitset_bytes; i++) > +{ > +unsigned char byte = filter->bitset[i]; > + > +while (byte) > +{ > +bits_set++; > +byte &= (byte - 1); > +} > +} I don't follow what you mean here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
On Wed, Aug 30, 2017 at 9:36 AM, Andres Freundwrote: > So the question is whether we want {max,min}_wal_size be sized in > multiples of segment sizes or as a proper byte size. I'm leaning > towards the latter. Logically in the code it is just a matter of adjusting multipliers. Do you think that we should worry about wal segment sizes higher than 2GB? Support for int64 GUCs is not here yet. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix drop replication slot blocking instead of returning error
On 29 August 2017 at 22:02, Andres Freundwrote: > Hi, > > On 2017-08-29 13:42:05 +0200, Simone Gotti wrote: > > On Tue, Aug 29, 2017 at 12:13 PM, Alvaro Herrera > > wrote: > > > > > > > Hi Alvaro, > > > > > Simone Gotti wrote: > > > > Hi all, > > > > > > > > I noticed that in postgres 10beta3, calling pg_drop_replication_slot > on an > > > > active slot will block until it's released instead of returning an > error > > > > like > > > > done in pg 9.6. Since this is a change in the previous behavior and > the docs > > > > wasn't changed I made a patch to restore the previous behavior. > > > > > > Changing that behavior was the entire point of the cited commit. > > > > Sorry, I was thinking that the new behavior was needed for internal > > future functions since the doc wasn't changed. > > FWIW, I also don't think it's ok to just change the behaviour > unconditionally and without a replacement for existing behaviour. Seems like it just needs a new argument nowait DEFAULT false Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
[HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
Hi, intentionally breaking the thread here, I want this one point to get a bit wider audience. The excerpt of the relevant discussion is: On 2017-08-23 12:13:15 +0530, Beena Emerson wrote: > >> + /* set default max_wal_size and min_wal_size */ > >> + snprintf(repltok, sizeof(repltok), "min_wal_size = %s", > >> + pretty_wal_size(DEFAULT_MIN_WAL_SEGS)); > >> + conflines = replace_token(conflines, "#min_wal_size = 80MB", > >> repltok); > >> + > >> + snprintf(repltok, sizeof(repltok), "max_wal_size = %s", > >> + pretty_wal_size(DEFAULT_MAX_WAL_SEGS)); > >> + conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok); > >> + > > > > Hm. So postgresql.conf.sample values are now going to contain misleading > > information for clusters with non-default segment sizes. > > > > Have we discussed instead defaulting min_wal_size/max_wal_size to a > > constant amount of megabytes and rounding up when it doesn't work for > > a particular segment size? > > This was not discussed. > > In the original code, the min_wal_size and max_wal_size are computed > in the guc.c for any wal_segment_size set at configure. > > { > {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS, > gettext_noop("Sets the minimum size to shrink the WAL to."), > NULL, > GUC_UNIT_MB > }, > _wal_size_mb, > 5 * (XLOG_SEG_SIZE / (1024 * 1024)), 2, MAX_KILOBYTES, > NULL, NULL, NULL > }, > > { > {"max_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS, > gettext_noop("Sets the WAL size that triggers a checkpoint."), > NULL, > GUC_UNIT_MB > }, > _wal_size_mb, > 64 * (XLOG_SEG_SIZE / (1024 * 1024)), 2, MAX_KILOBYTES, > NULL, assign_max_wal_size, NULL > }, > > Hence I have retained the same calculation for min_wal_size and > max_wal_size. If we get consensus for fixing a default and updating > when required, then I will change the code accordingly. So the question is whether we want {max,min}_wal_size be sized in multiples of segment sizes or as a proper byte size. I'm leaning towards the latter. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Aug 30, 2017 at 8:34 AM, Thomas Munrowrote: > It'd be nice to replace both with fls() or flsl(), though it's > annoying to have to think about long vs int64 etc. We already use > fls() in two places and supply an implementation in src/port/fls.c for > platforms that lack it (Windows?), but not the long version. Yes, you can complain about MSVC compilation here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
On 08/30/2017 02:19 AM, Andres Freund wrote: > On 2017-08-30 02:11:10 +0200, Tomas Vondra wrote: >> >> I'm not really following your reasoning. You may very well be right that >> the BeginInternalSubTransaction() example is not quite valid on postgres >> core, but I don't see how that implies there can be no other errors in >> the PG_TRY block. It was merely an explanation how I noticed this issue. >> >> To make it absolutely clear, I claim that the PG_CATCH() block is >> demonstrably broken as it calls AbortCurrentTransaction() and then >> accesses already freed memory. > > Yea, but that's not what happens normally. The txn memory isn't > allocated in the context created by the started [sub]transaction. I > think you're just seeing what you're seeing because an error thrown > before the BeginInternalSubTransaction() succeeds, will roll back the > *containing* transaction (the one that did the SQL SELECT * FROM > pg_*logical*() call), and free all the entire reorderbuffer stuff. > > >> The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in >> the switch, and AFAICS hitting any of them will have exactly the same >> effect as failure in BeginInternalSubTransaction(). > > No, absolutely not. Once the subtransaction starts, the > AbortCurrentTransaction() will abort that subtransaction, rather than > the containing one. > Ah, I see. You're right elog(ERROR) calls after the subtransaction start are handled correctly and don't trigger a segfault. Sorry for the noise and thanks for the explanation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
On 2017-08-30 02:11:10 +0200, Tomas Vondra wrote: > > > On 08/30/2017 01:34 AM, Andres Freund wrote: > > Hi, > > > > On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote: > >> I've been investigating some failures in test_decoding regression tests, > >> and it seems to me the error-handling in ReorderBufferCommit() is > >> somewhat broken, leading to segfault crashes. > >> > >> The problematic part is this: > >> > >> PG_CATCH() > >> { > >> /* > >> * Force cache invalidation to happen outside of a valid transaction > >> * to prevent catalog access as we just caught an error. > >> */ > >> AbortCurrentTransaction(); > >> > >> /* make sure there's no cache pollution */ > >> ReorderBufferExecuteInvalidations(rb, txn); > >> > >> ... > >> > >> } > >> > >> Triggering it trivial - just add elog(ERROR,...) at the beginning of the > >> PG_TRY() block. > > > > That's not really a valid thing to do, you should put it after the > > BeginInternalSubTransaction()/StartTransactionCommand(). It's basically > > assumed that those won't fail - arguably they should be outside of a > > PG_TRY then, but that's a different matter. If you start decoding > > outside of SQL failing before those will lead to rolling back the parent > > tx... > > > > > >> I suppose this is not quite intentional, but rather an example that > >> error-handling code is an order of magnitude more complicated to write > >> and test. I've only noticed as I'm investigating some regression > >> failures on Postgres-XL 10, which does not support subtransactions and > >> so the BeginInternalSubTransaction() call in the try branch always > >> fails, triggering the issue. > > > > So, IIUC, there's no live problem in postgres core, besides some ugly & > > undocumented assumptions? > > > > I'm not really following your reasoning. You may very well be right that > the BeginInternalSubTransaction() example is not quite valid on postgres > core, but I don't see how that implies there can be no other errors in > the PG_TRY block. It was merely an explanation how I noticed this issue. > > To make it absolutely clear, I claim that the PG_CATCH() block is > demonstrably broken as it calls AbortCurrentTransaction() and then > accesses already freed memory. Yea, but that's not what happens normally. The txn memory isn't allocated in the context created by the started [sub]transaction. I think you're just seeing what you're seeing because an error thrown before the BeginInternalSubTransaction() succeeds, will roll back the *containing* transaction (the one that did the SQL SELECT * FROM pg_*logical*() call), and free all the entire reorderbuffer stuff. > The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in > the switch, and AFAICS hitting any of them will have exactly the same > effect as failure in BeginInternalSubTransaction(). No, absolutely not. Once the subtransaction starts, the AbortCurrentTransaction() will abort that subtransaction, rather than the containing one. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuple-routing for certain partitioned tables not working as expected
On 2017/08/29 20:18, Etsuro Fujita wrote: > On 2017/08/25 22:26, Robert Haas wrote: >> On Wed, Aug 23, 2017 at 4:55 AM, Etsuro Fujita >>wrote: >>> Agreed, but I'd vote for fixing this in v10 as proposed; I agree that just >>> ripping the CheckValidResultRel checks out entirely is not a good idea, >>> but >>> that seems OK to me at least as a fix just for v10. >> >> I'm still not on-board with having this be the one case where we don't >> do CheckValidResultRel. If we want to still call it but pass down >> some additional information that can selectively skip certain checks, >> I could probably live with that. > > Another idea would be to not do CheckValidResultRel for partitions in > ExecSetupPartitionTupleRouting; instead, do that the first time the > partition is chosen by ExecFindPartition, and if successfully checked, > initialize the partition's ResultRelInfo and other stuff. (We could skip > this after the first time by checking whether we already have a valid > ResultRelInfo for the chosen partition.) That could allow us to not only > call CheckValidResultRel the way it is, but avoid initializing useless > partitions for which tuples aren't routed at all. I too have thought about the idea of lazy initialization of the partition ResultRelInfos. I think it would be a good idea, but definitely something for PG 11. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
On 08/30/2017 01:34 AM, Andres Freund wrote: > Hi, > > On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote: >> I've been investigating some failures in test_decoding regression tests, >> and it seems to me the error-handling in ReorderBufferCommit() is >> somewhat broken, leading to segfault crashes. >> >> The problematic part is this: >> >> PG_CATCH() >> { >> /* >> * Force cache invalidation to happen outside of a valid transaction >> * to prevent catalog access as we just caught an error. >> */ >> AbortCurrentTransaction(); >> >> /* make sure there's no cache pollution */ >> ReorderBufferExecuteInvalidations(rb, txn); >> >> ... >> >> } >> >> Triggering it trivial - just add elog(ERROR,...) at the beginning of the >> PG_TRY() block. > > That's not really a valid thing to do, you should put it after the > BeginInternalSubTransaction()/StartTransactionCommand(). It's basically > assumed that those won't fail - arguably they should be outside of a > PG_TRY then, but that's a different matter. If you start decoding > outside of SQL failing before those will lead to rolling back the parent > tx... > > >> I suppose this is not quite intentional, but rather an example that >> error-handling code is an order of magnitude more complicated to write >> and test. I've only noticed as I'm investigating some regression >> failures on Postgres-XL 10, which does not support subtransactions and >> so the BeginInternalSubTransaction() call in the try branch always >> fails, triggering the issue. > > So, IIUC, there's no live problem in postgres core, besides some ugly & > undocumented assumptions? > I'm not really following your reasoning. You may very well be right that the BeginInternalSubTransaction() example is not quite valid on postgres core, but I don't see how that implies there can be no other errors in the PG_TRY block. It was merely an explanation how I noticed this issue. To make it absolutely clear, I claim that the PG_CATCH() block is demonstrably broken as it calls AbortCurrentTransaction() and then accesses already freed memory. The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in the switch, and AFAICS hitting any of them will have exactly the same effect as failure in BeginInternalSubTransaction(). And I suppose there many other ways to trigger an error and get into the catch block. In other words, why would we have the block at all? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Wed, Aug 30, 2017 at 7:58 AM, Peter Geogheganwrote: > On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan wrote: >> I spent only a few hours writing a rough prototype, and came up with >> something that does an IndexBuildHeapScan() scan following the >> existing index verification steps. Its amcheck callback does an >> index_form_tuple() call, hashes the resulting IndexTuple (heap TID and >> all), and tests it for membership of a bloom filter generated as part >> of the main B-Tree verification phase. The IndexTuple memory is freed >> immediately following hashing. > > I attach a cleaned-up version of this. It has extensive documentation. > My bloom filter implementation is broken out as a separate patch, > added as core infrastructure under "lib". Some drive-by comments on the lib patch: +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len) I think the plan is to use size_t for new stuff[1]. +/* + * Which element of the sequence of powers-of-two is less than or equal to n? + * + * Used to size bitset, which in practice is never allowed to exceed 2 ^ 30 + * bits (128MB). This frees us from giving further consideration to int + * overflow. + */ +static int +pow2_truncate(int64 target_bitset_size) +{ +int v = 0; + +while (target_bitset_size > 0) +{ +v++; +target_bitset_size = target_bitset_size >> 1; +} + +return Min(v - 1, 30); +} This is another my_log2(), right? It'd be nice to replace both with fls() or flsl(), though it's annoying to have to think about long vs int64 etc. We already use fls() in two places and supply an implementation in src/port/fls.c for platforms that lack it (Windows?), but not the long version. +/* + * Hash function is taken from sdbm, a public-domain reimplementation of the + * ndbm database library. + */ +static uint32 +sdbmhash(unsigned char *elem, Size len) +{ +uint32hash = 0; +inti; + +for (i = 0; i < len; elem++, i++) +{ +hash = (*elem) + (hash << 6) + (hash << 16) - hash; +} + +return hash; +} I see that this is used in gawk, BerkeleyDB and all over the place[2]. Nice. I understand that this point of this is to be a hash function that is different from our usual one, for use by k_hashes(). Do you think it belongs somewhere more common than this? It seems a bit like our hash-related code is scattered all over the place but should be consolidated, but I suppose that's a separate project. Unnecessary braces here and elsewhere for single line body of for loops. +bloom_prop_bits_set(bloom_filter *filter) +{ +intbitset_bytes = NBITS(filter) / BITS_PER_BYTE; +int64bits_set = 0; +inti; + +for (i = 0; i < bitset_bytes; i++) +{ +unsigned char byte = filter->bitset[i]; + +while (byte) +{ +bits_set++; +byte &= (byte - 1); +} +} Sorry I didn't follow up with my threat[3] to provide a central popcount() function to replace the implementations all over the tree. [1] https://www.postgresql.org/message-id/25076.1489699457%40sss.pgh.pa.us [2] http://www.cse.yorku.ca/~oz/hash.html [3] https://www.postgresql.org/message-id/CAEepm%3D3g1_fjJGp38QGv%2B38BC2HHVkzUq6s69nk3mWLgPHqC3A%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
Hi, On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote: > I've been investigating some failures in test_decoding regression tests, > and it seems to me the error-handling in ReorderBufferCommit() is > somewhat broken, leading to segfault crashes. > > The problematic part is this: > > PG_CATCH() > { > /* > * Force cache invalidation to happen outside of a valid transaction > * to prevent catalog access as we just caught an error. > */ > AbortCurrentTransaction(); > > /* make sure there's no cache pollution */ > ReorderBufferExecuteInvalidations(rb, txn); > > ... > > } > > Triggering it trivial - just add elog(ERROR,...) at the beginning of the > PG_TRY() block. That's not really a valid thing to do, you should put it after the BeginInternalSubTransaction()/StartTransactionCommand(). It's basically assumed that those won't fail - arguably they should be outside of a PG_TRY then, but that's a different matter. If you start decoding outside of SQL failing before those will lead to rolling back the parent tx... > I suppose this is not quite intentional, but rather an example that > error-handling code is an order of magnitude more complicated to write > and test. I've only noticed as I'm investigating some regression > failures on Postgres-XL 10, which does not support subtransactions and > so the BeginInternalSubTransaction() call in the try branch always > fails, triggering the issue. So, IIUC, there's no live problem in postgres core, besides some ugly & undocumented assumptions? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
Hi, I've been investigating some failures in test_decoding regression tests, and it seems to me the error-handling in ReorderBufferCommit() is somewhat broken, leading to segfault crashes. The problematic part is this: PG_CATCH() { /* * Force cache invalidation to happen outside of a valid transaction * to prevent catalog access as we just caught an error. */ AbortCurrentTransaction(); /* make sure there's no cache pollution */ ReorderBufferExecuteInvalidations(rb, txn); ... } Triggering it trivial - just add elog(ERROR,...) at the beginning of the PG_TRY() block. The problem is that AbortCurrentTransaction() apparently releases the memory where txn is allocated, making it entirely bogus. So in assert builds txn->ivalidations are 0x7f7f7f7f7f7f7f7f, triggering a segfault. Similar issues apply to subsequent calls in the catch block, that also use txn in some way (e.g. through snapshot_now). I suppose this is not quite intentional, but rather an example that error-handling code is an order of magnitude more complicated to write and test. I've only noticed as I'm investigating some regression failures on Postgres-XL 10, which does not support subtransactions and so the BeginInternalSubTransaction() call in the try branch always fails, triggering the issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
Hi, On 2017-08-23 12:13:15 +0530, Beena Emerson wrote: > >> + /* > >> + * The calculation of XLOGbuffers requires the run-time > >> parameter > >> + * XLogSegSize which is set from the control file. This > >> value is > >> + * required to create the shared memory segment. Hence, > >> temporarily > >> + * allocate space for reading the control file. > >> + */ > > > > This makes me uncomfortable. Having to choose the control file multiple > > times seems wrong. We're effectively treating the control file as part > > of the configuration now, and that means we should move it's parsing to > > an earlier part of startup. > > Yes, this may seem ugly. ControlFile was originally read into the > shared memory segment but then we now need the XLogSegSize from the > ControlFile to initialise the shared memory segment. I could not > figure out any other way to achieve this. I think reading it one into local memory inside the startup process and then copying it into shared memory from there should work? > >> @@ -8146,6 +8181,9 @@ InitXLOGAccess(void) > >> ThisTimeLineID = XLogCtl->ThisTimeLineID; > >> Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode()); > >> > >> + /* set XLogSegSize */ > >> + XLogSegSize = ControlFile->xlog_seg_size; > >> + > > > > Hm, why do we have two variables keeping track of the segment size? > > wal_segment_size and XLogSegSize? That's bound to lead to confusion. > > > > wal_segment_size is the guc which stores the number of segments > (XLogSegSize / XLOG_BLCKSZ). wal_segment_size and XLogSegSize are the same name, spelt different, so if that's where we want to go, we should name them differently. But perhaps more fundamentally, I don't see why we need both: What stops us from just defining the GUC in bytes? Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Improve history file reasons when doing promotion
Hi all, When triggering a promotion, the history file generated for the timeline bump provides a reason behind the timeline bump not really helpful: $ cat 0002.history 1 0/3000858 no recovery target specified I was wondering if we could improve that a bit for a promotion. One thing popping up immediately to my mind would be to check for IsPromoteTriggered() and replace this reason by "standby promotion". Any thoughts perhaps? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make pg_regress print a connstring with sockdir
On Mon, Aug 28, 2017 at 7:57 AM, Craig Ringerwrote: > I'm frequently debugging postmasters that are around long enough. Deadlocks, > etc. > > It's also way easier to debug shmem related issues with a live postmaster vs > a core. Yeah. I don't *frequently* debug postmasters that hang during the regression tests, but I definitely have done it, and I think something like this would make it easier. Right now if something wedges and you need to connect to the postmaster to see what's going on, you have to grep for the pid, then lsof to get the socket directory. This would simplify things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw bug in 9.6
Etsuro Fujitawrites: > [ epqpath-for-foreignjoin-11.patch ] I started looking at this. I've not yet wrapped my head around what CreateLocalJoinPath() is doing, but I noted that Robert's concerns about ABI breakage in the back branches would not be very difficult to satisfy. What we'd need to do is (1) In the back-branch patch, add the new fields of JoinPathExtraData at the end, rather than in their more natural locations. This should avoid any ABI break for uses of that struct, since there's no reason why an FDW would be creating its own variables of that type rather than just using what the core code passes it. (2) In the back branches, just leave GetExistingLocalJoinPath in place rather than deleting it. Existing FDWs would still be subject to the bug until they were updated, but given how hard it is to trigger, that doesn't seem like a huge problem. A variant of (2) is to install a hack fix in GetExistingLocalJoinPath rather than leaving it unchanged. I'm not very excited about that though; it doesn't seem unlikely that we might introduce new bugs that way, and it would certainly be a distraction from getting the real fix finished. We'd definitely need to do things that way in 9.6. I'm not quite sure whether it's too late to adopt the clean solution in v10. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Jul 4, 2017 at 12:33 AM, Amit Kapilawrote: > I have updated the patch to support wait events and moved it to upcoming CF. This patch doesn't apply any more, but I made it apply with a hammer and then did a little benchmarking (scylla, EDB server, Intel Xeon E5-2695 v3 @ 2.30GHz, 2 sockets, 14 cores/socket, 2 threads/core). The results were not impressive. There's basically no clog contention to remove, so the patch just doesn't really do anything. For example, here's a wait event profile with master and using Ashutosh's test script with 5 savepoints: 1 Lock| tuple 2 IO | SLRUSync 5 LWLock | wal_insert 5 LWLock | XidGenLock 9 IO | DataFileRead 12 LWLock | lock_manager 16 IO | SLRURead 20 LWLock | CLogControlLock 97 LWLock | buffer_content 216 Lock| transactionid 237 LWLock | ProcArrayLock 1238 IPC | ProcArrayGroupUpdate 2266 Client | ClientRead This is just a 5-minute test; maybe things would change if we ran it for longer, but if only 0.5% of the samples are blocked on CLogControlLock without the patch, obviously the patch can't help much. I did some other experiments too, but I won't bother summarizing the results here because they're basically boring. I guess I should have used a bigger machine. Given that we've changed the approach here somewhat, I think we need to validate that we're still seeing a substantial reduction in CLogControlLock contention on big machines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
On Tue, Aug 29, 2017 at 1:08 AM, Dilip Kumarwrote: > (Time in ms) > Queryhead patch > > 6 2381914571 > 14 1351411183 > 15 49980 32400 > 20204441 188978 These are cool results, but this patch is obviously not ready for prime time as-is, since there are various other references that will need to be updated: * Since we are called as soon as nentries exceeds maxentries, we should * push nentries down to significantly less than maxentries, or else we'll * just end up doing this again very soon. We shoot for maxentries/2. /* * With a big bitmap and small work_mem, it's possible that we cannot get * under maxentries. Again, if that happens, we'd end up uselessly * calling tbm_lossify over and over. To prevent this from becoming a * performance sink, force maxentries up to at least double the current * number of entries. (In essence, we're admitting inability to fit * within work_mem when we do this.) Note that this test will not fire if * we broke out of the loop early; and if we didn't, the current number of * entries is simply not reducible any further. */ if (tbm->nentries > tbm->maxentries / 2) tbm->maxentries = Min(tbm->nentries, (INT_MAX - 1) / 2) * 2; I suggest defining a TBM_FILLFACTOR constant instead of repeating the value 0.9 in a bunch of places. I think it would also be good to try to find the sweet spot for that constant. Making it bigger reduces the number of lossy entries created, but making it smaller reduces the number of times we have to walk the bitmap. So if, for example, 0.75 is sufficient to produce almost all of the gain, then I think we would want to prefer 0.75 to 0.9. But if 0.9 is better, then we can stick with that. Note that a value higher than 0.9375 wouldn't be sane without some additional safety precautions because maxentries could be as low as 16. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving overflow checks when adding tuple to PGresult Re: [GENERAL] Retrieving query results
On Wed, Aug 30, 2017 at 4:24 AM, Tom Lanewrote: > Michael Paquier writes: >> On Mon, Aug 28, 2017 at 3:05 PM, Michael Paquier >> wrote: >>> Attached are two patches: >>> 1) 0001 refactors the code around pqAddTuple to be able to handle >>> error messages and assign them in PQsetvalue particularly. >>> 2) 0002 adds sanity checks in pqAddTuple for overflows, maximizing the >>> size of what is allocated to INT_MAX but now more. > > I've pushed these (as one commit) with some adjustments. Thanks! > Mainly, > I changed PQsetvalue to report failure messages with PQinternalNotice, > which is what already happens inside check_field_number() for the case > of an out-of-range field number. It's possible that storing the > message into the PGresult in addition would be worth doing, but I'm > unconvinced about that --- we certainly haven't had any field requests > for it. OK, fine for me. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A design for amcheck heapam verification
On Thu, May 11, 2017 at 4:30 PM, Peter Geogheganwrote: > I spent only a few hours writing a rough prototype, and came up with > something that does an IndexBuildHeapScan() scan following the > existing index verification steps. Its amcheck callback does an > index_form_tuple() call, hashes the resulting IndexTuple (heap TID and > all), and tests it for membership of a bloom filter generated as part > of the main B-Tree verification phase. The IndexTuple memory is freed > immediately following hashing. I attach a cleaned-up version of this. It has extensive documentation. My bloom filter implementation is broken out as a separate patch, added as core infrastructure under "lib". I do have some outstanding concerns about V1 of the patch series: * I'm still uncertain about the question of using IndexBuildHeapScan() during Hot Standby. It seems safe, since we're only using the CONCURRENTLY/AccessShareLock path when this happens, but someone might find that objectionable on general principle. For now, in this first version, it remains possible to call IndexBuildHeapScan() during Hot Standby, to allow the new verification to work there. * The bloom filter has been experimentally verified, and is based on source material which is directly cited. It would nevertheless be useful to have the hashing stuff scrutinized, because it's possible that I've overlooked some subtlety. This is only the beginning for heapam verification. Comprehensive coverage can be added later, within routines that specifically target some table, not some index. While this patch series only adds index-to-heap verification for B-Tree indexes, I can imagine someone adopting the same technique to verifying that other access methods are consistent with their heap relation. For example, it would be easy to do this with hash indexes. Any other index access method where the same high-level principle that I rely on applies can do index-to-heap verification with just a few tweaks. I'm referring to the high-level principle that comments specifically point out in the patch: that REINDEX leaves you with an index structure that has exactly the same entries as the old index structure had, though possibly with fewer dead index tuples. I like my idea of reusing IndexBuildHeapScan() for verification. Very few new LOCs are actually added to amcheck by this patch, and IndexBuildHeapScan() is itself tested. -- Peter Geoghegan From 48499cfb58b7bf705e93fb12cc5359ec12cd9c51 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 2 May 2017 00:19:24 -0700 Subject: [PATCH 2/2] Add amcheck verification of indexes against heap. Add a new, optional capability to bt_index_check() and bt_index_parent_check(): callers can check that each heap tuple that ought to have an index entry does in fact have one. This happens at the end of the existing verification checks. This is implemented by using a bloom filter data structure. The implementation performs set membership tests within a callback (the same type of callback that each index AM registers for CREATE INDEX). The bloom filter is populated during the initial index verification scan. --- contrib/amcheck/Makefile | 2 +- contrib/amcheck/amcheck--1.0--1.1.sql| 28 contrib/amcheck/amcheck.control | 2 +- contrib/amcheck/expected/check_btree.out | 14 +- contrib/amcheck/sql/check_btree.sql | 9 +- contrib/amcheck/verify_nbtree.c | 236 --- doc/src/sgml/amcheck.sgml| 103 +++--- 7 files changed, 345 insertions(+), 49 deletions(-) create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile index 43bed91..c5764b5 100644 --- a/contrib/amcheck/Makefile +++ b/contrib/amcheck/Makefile @@ -4,7 +4,7 @@ MODULE_big = amcheck OBJS = verify_nbtree.o $(WIN32RES) EXTENSION = amcheck -DATA = amcheck--1.0.sql +DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql PGFILEDESC = "amcheck - function for verifying relation integrity" REGRESS = check check_btree diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql new file mode 100644 index 000..e6cca0a --- /dev/null +++ b/contrib/amcheck/amcheck--1.0--1.1.sql @@ -0,0 +1,28 @@ +/* contrib/amcheck/amcheck--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit + +-- +-- bt_index_check() +-- +DROP FUNCTION bt_index_check(regclass); +CREATE FUNCTION bt_index_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- +-- bt_index_parent_check() +-- +DROP FUNCTION bt_index_parent_check(regclass); +CREATE FUNCTION bt_index_parent_check(index regclass, +heapallindexed boolean DEFAULT false) +RETURNS VOID +AS
Re: [HACKERS] Improving overflow checks when adding tuple to PGresult Re: [GENERAL] Retrieving query results
Michael Paquierwrites: > On Mon, Aug 28, 2017 at 3:05 PM, Michael Paquier > wrote: >> Attached are two patches: >> 1) 0001 refactors the code around pqAddTuple to be able to handle >> error messages and assign them in PQsetvalue particularly. >> 2) 0002 adds sanity checks in pqAddTuple for overflows, maximizing the >> size of what is allocated to INT_MAX but now more. I've pushed these (as one commit) with some adjustments. Mainly, I changed PQsetvalue to report failure messages with PQinternalNotice, which is what already happens inside check_field_number() for the case of an out-of-range field number. It's possible that storing the message into the PGresult in addition would be worth doing, but I'm unconvinced about that --- we certainly haven't had any field requests for it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Tue, Aug 22, 2017 at 8:14 AM, amul sulwrote: > Attaching patch 0002 for the reviewer's testing. I think that this 0002 is not something we can think of committing because there's no guarantee that hash functions will return the same results on all platforms. However, what we could and, I think, should do is hash some representative values of each data type and verify that hash(x) and hashextended(x, 0) come out the same at least as to the low-order 32 bits -- and maybe that hashextend(x, 1) comes out different as to the low-order 32 bits. The easiest way to do this seems to be to cast to bit(32). For example: SELECT v, hashint4(v)::bit(32) as standard, hashint4extended(v, 0)::bit(32) as extended0, hashint4extended(v, 1)::bit(32) as extended1 FROM (VALUES (0), (1), (17), (42), (550273), (207112489)) x(v) WHERE hashint4(v)::bit(32) != hashint4extended(v, 0)::bit(32) OR hashint4(v)::bit(32) = hashint4extended(v, 1)::bit(32); I suggest adding a version of this for each data type. >From your latest version of 0001, I get: rangetypes.c:1297:8: error: unused variable 'rngtypid' [-Werror,-Wunused-variable] Oid rngtypid = RangeTypeGetOid(r); I suggest not duplicating the comments from each regular function into the extended function, but just writing /* Same approach as hashfloat8 */ when the implementation is non-trivial (no need for this if the extended function is a single line or the original function had no comments anyway). hash_aclitem seems embarrassingly stupid. I suggest that we make the extended version slightly less stupid -- i.e. if the seed is non-zero, actually call hash_any_extended on the sum and pass the seed through. * Reset info about hash function whenever we pick up new info about * equality operator. This is so we can ensure that the hash function * matches the operator. */ typentry->flags &= ~(TCFLAGS_CHECKED_HASH_PROC); +typentry->flags &= ~(TCFLAGS_CHECKED_HASH_EXTENDED_PROC); Adjust comment: "about hash function" -> "about hash functions", "hash functions matches" -> "hash functions match". +extern Datum +hash_any_extended(register const unsigned char *k, register int + keylen, uint64 seed); Formatting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] [postgresql 10 beta3] unrecognized node type: 90
Amit Kapilawrites: > On Tue, Aug 29, 2017 at 8:32 AM, Robert Haas wrote: >> There's already ExecParallelReinitialize, which could be made to walk >> the nodes in addition to what it does already, but I don't understand >> exactly what else needs fixing. > Sure, but it is not advisable to reset state of all the nodes below > gather at that place, otherwise, it will be more or less like we are > forcing rescan of each node. I think there we can reset the shared > parallel state of parallel-aware nodes (or anything related) and then > allow rescan to reset the master backend specific state for all nodes > beneath gather. Right, the idea is to make this happen separately from the "rescan" logic. In general, it's a good idea for ExecReScanFoo to do as little as possible, so that you don't pay if a node is rescanned more than once before it's asked to do anything, or indeed if no rows are ever demanded from it at all. Attached is a WIP patch along this line. It's unfinished because I've not done the tedium of extending the FDW and CustomScan APIs to support this new type of per-node operation; but that part seems straightforward enough. The core code is complete and survived light testing. I'm pretty happy with the results --- note in particular how we get rid of some very dubious coding in ExecReScanIndexScan and ExecReScanIndexOnlyScan. If you try the test case from a2b70c89c on this patch alone, you'll notice that instead of sometimes reporting too-small counts during the rescans, it pretty consistently reports too-large counts. This is because we are now successfully resetting the shared state for the parallel seqscan, but we haven't done anything about the leader's HashAgg node deciding that it can re-use its old hashtable. So on the first scan, the leader typically scans all or most of the table because of its startup time advantage, and saves those counts in its hashtable. On later scans, the workers read all of the table while the leader decides it need do no scanning. So we get counts that reflect all of the table (from the workers) plus whatever part of the table the leader read the first time. So this by no means removes the need for my other patch. If no objections, I'll do the additional legwork and push. As before, I think we can probably get away without fixing 9.6, even though it's nominally got the same bug. regards, tom lane diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index ff03c68..e29c5ad 100644 *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** heap_rescan(HeapScanDesc scan, *** 1531,1551 * reinitialize scan descriptor */ initscan(scan, key, true); - - /* - * reset parallel scan, if present - */ - if (scan->rs_parallel != NULL) - { - ParallelHeapScanDesc parallel_scan; - - /* - * Caller is responsible for making sure that all workers have - * finished the scan before calling this. - */ - parallel_scan = scan->rs_parallel; - pg_atomic_write_u64(_scan->phs_nallocated, 0); - } } /* --- 1531,1536 *** heap_parallelscan_initialize(ParallelHea *** 1643,1648 --- 1628,1646 } /* + * heap_parallelscan_reinitialize - reset a parallel scan + * + * Call this in the leader process. Caller is responsible for + * making sure that all workers have finished the scan beforehand. + * + */ + void + heap_parallelscan_reinitialize(ParallelHeapScanDesc parallel_scan) + { + pg_atomic_write_u64(_scan->phs_nallocated, 0); + } + + /* * heap_beginscan_parallel - join a parallel scan * * Caller must hold a suitable lock on the correct relation. diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c index ce47f1d..d8cdb0e 100644 *** a/src/backend/executor/execParallel.c --- b/src/backend/executor/execParallel.c *** static bool ExecParallelInitializeDSM(Pl *** 109,114 --- 109,116 ExecParallelInitializeDSMContext *d); static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt, bool reinitialize); + static bool ExecParallelReInitializeDSM(PlanState *planstate, + ParallelContext *pcxt); static bool ExecParallelRetrieveInstrumentation(PlanState *planstate, SharedExecutorInstrumentation *instrumentation); *** ExecParallelSetupTupleQueues(ParallelCon *** 365,382 } /* - * Re-initialize the parallel executor info such that it can be reused by - * workers. - */ - void - ExecParallelReinitialize(ParallelExecutorInfo *pei) - { - ReinitializeParallelDSM(pei->pcxt); - pei->tqueue = ExecParallelSetupTupleQueues(pei->pcxt, true); - pei->finished = false; - } - - /* * Sets up the required infrastructure for backend workers to perform * execution and
[HACKERS] OpenFile() Permissions Refactor
Hackers, While working on the patch to allow group reads of $PGDATA I refactored the various OpenFile() functions to use default/global permissions rather than permissions defined in each call. I think the patch stands on its own regardless of whether we accept the patch to allow group permissions (which won't make this CF). We had a couple different ways of defining permissions (e.g. 0600 vs S_IRUSR | S_IWUSR) and in any case it represented quite a bit of duplication. This way seems simpler and less error-prone. I have intentionally not touched the open/fopen/mkdir calls in these files since they are specialized and require per-instance consideration (if they are changed at all): backend/postmaster/fork_process.c backend/postmaster/postmaster.c backend/utils/error/elog.c backend/utils/init/miscinit.c All tests pass but it's possible that I've missed something or changed something that shouldn't be changed. I'll submit the patch to the 2017-09 CF. Thanks, -- -David da...@pgmasters.net diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index fa409d72b7..3ab1fd2db4 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -1869,8 +1869,7 @@ qtext_store(const char *query, int query_len, *query_offset = off; /* Now write the data into the successfully-reserved part of the file */ - fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDWR | O_CREAT | PG_BINARY, - S_IRUSR | S_IWUSR); + fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDWR | O_CREAT | PG_BINARY); if (fd < 0) goto error; @@ -1934,7 +1933,7 @@ qtext_load_file(Size *buffer_size) int fd; struct stat stat; - fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDONLY | PG_BINARY, 0); + fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDONLY | PG_BINARY); if (fd < 0) { if (errno != ENOENT) diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index bd560e47e1..f93c194e18 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -1013,8 +1013,7 @@ logical_rewrite_log_mapping(RewriteState state, TransactionId xid, src->off = 0; memcpy(src->path, path, sizeof(path)); src->vfd = PathNameOpenFile(path, - O_CREAT | O_EXCL | O_WRONLY | PG_BINARY, - S_IRUSR | S_IWUSR); + O_CREAT | O_EXCL | O_WRONLY | PG_BINARY); if (src->vfd < 0) ereport(ERROR, (errcode_for_file_access(), @@ -1133,8 +1132,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r) xlrec->mapped_xid, XLogRecGetXid(r)); fd = OpenTransientFile(path, - O_CREAT | O_WRONLY | PG_BINARY, - S_IRUSR | S_IWUSR); + O_CREAT | O_WRONLY | PG_BINARY); if (fd < 0) ereport(ERROR, (errcode_for_file_access(), @@ -1258,7 +1256,7 @@ CheckPointLogicalRewriteHeap(void) } else { - int fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0); + int fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); /* * The file cannot vanish due to concurrency since this function diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 77edc51e1c..9dd77190ec 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -599,7 +599,7 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno) SlruFileName(ctl, path, segno); - fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); + fd = OpenTransientFile(path, O_RDWR | PG_BINARY); if (fd < 0) { /* expected: file doesn't exist */ @@ -654,7 +654,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno) * SlruPhysicalWritePage). Hence, if we are InRecovery, allow the case * where the file doesn't exist, and return zeroes instead. */ - fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR); + fd = OpenTransientFile(path, O_RDWR | PG_BINARY); if (fd < 0) { if (errno != ENOENT || !InRecovery) @@ -804,8 +804,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush
Re: [HACKERS] Explicit relation name in VACUUM VERBOSE log
On Tue, Aug 22, 2017 at 2:23 AM, Simon Riggswrote: > Yes, we can. I'm not sure why you would do this only for VACUUM > though? I see many messages in various places that need same treatment I'm skeptical about the idea of doing this too generally. rhaas=> select * from foo; ERROR: permission denied for relation foo Do we really want to start saying public.foo in all such error messages? To me, that's occasionally helpful but more often just useless chatter. One problem with making this behavior optional is that we'd then need two separate translatable strings in every case, one saying "table %s has problems" and the other saying "table %s.%s has problems". Maybe we could avoid that via some clever trick, but that's how we're doing it in some existing cases. I have a feeling we're making a small patch with a narrow goal into a giant patch for which everyone has a different goal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix drop replication slot blocking instead of returning error
Hi, On 2017-08-29 13:42:05 +0200, Simone Gotti wrote: > On Tue, Aug 29, 2017 at 12:13 PM, Alvaro Herrera >wrote: > > > > Hi Alvaro, > > > Simone Gotti wrote: > > > Hi all, > > > > > > I noticed that in postgres 10beta3, calling pg_drop_replication_slot on an > > > active slot will block until it's released instead of returning an error > > > like > > > done in pg 9.6. Since this is a change in the previous behavior and the > > > docs > > > wasn't changed I made a patch to restore the previous behavior. > > > > Changing that behavior was the entire point of the cited commit. > > Sorry, I was thinking that the new behavior was needed for internal > future functions since the doc wasn't changed. FWIW, I also don't think it's ok to just change the behaviour unconditionally and without a replacement for existing behaviour. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Update low-level backup documentation to match actual behavior
Hi Robert, On 8/25/17 4:03 PM, David Steele wrote: > On 8/25/17 3:26 PM, Robert Haas wrote: >> On Fri, Aug 25, 2017 at 3:21 PM, David Steele>> wrote: >>> No problem. I'll base it on your commit to capture any changes you >>> made. >> >> Thanks, but you incorporated everything I wanted in response to my >> first review -- so I didn't tweak it any further. > > Thank you for committing that. I'll get the 9.6 patch to you early next > week. Attached is the 9.6 patch. It required a bit more work in func.sgml than I was expecting so have a close look at that. The rest was mostly removing irrelevant hunks. Thanks, -- -David da...@pgmasters.net diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 03c0dbf1cd..456f6f2c98 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -889,8 +889,11 @@ SELECT pg_start_backup('label', false, false); SELECT * FROM pg_stop_backup(false); - This terminates the backup mode and performs an automatic switch to - the next WAL segment. The reason for the switch is to arrange for + This terminates backup mode. On a primary, it also performs an automatic + switch to the next WAL segment. On a standby, it is not possible to + automatically switch WAL segments, so you may wish to run + pg_switch_wal on the primary to perform a manual + switch. The reason for the switch is to arrange for the last WAL segment file written during the backup interval to be ready to archive. @@ -908,7 +911,7 @@ SELECT * FROM pg_stop_backup(false); Once the WAL segment files active during the backup are archived, you are done. The file identified by pg_stop_backup's first return value is the last segment that is required to form a complete set of - backup files. If archive_mode is enabled, + backup files. On a primary, if archive_mode is enabled, pg_stop_backup does not return until the last segment has been archived. Archiving of these files happens automatically since you have @@ -924,6 +927,13 @@ SELECT * FROM pg_stop_backup(false); pg_stop_backup terminates because of this your backup may not be valid. + + + Note that on a standby pg_stop_backup does not wait for + WAL segments to be archived so the backup process must ensure that all WAL + segments required for the backup are successfully archived. + + @@ -932,9 +942,9 @@ SELECT * FROM pg_stop_backup(false); Making an exclusive low level backup The process for an exclusive backup is mostly the same as for a - non-exclusive one, but it differs in a few key steps. It does not allow - more than one concurrent backup to run, and there can be some issues on - the server if it crashes during the backup. Prior to PostgreSQL 9.6, this + non-exclusive one, but it differs in a few key steps. This type of backup + can only be taken on a primary and does not allow concurrent backups. + Prior to PostgreSQL 9.6, this was the only low-level method available, but it is now recommended that all users upgrade their scripts to use non-exclusive backups if possible. @@ -992,6 +1002,11 @@ SELECT pg_start_backup('label', true); for things to consider during this backup. + + Note that if the server crashes during the backup it may not be + possible to restart until the backup_label file has been + manually deleted from the PGDATA directory. + diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 8482601294..a803e747b2 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18070,11 +18070,18 @@ postgres=# select pg_start_backup('label_goes_here'); pg_start_backup. In a non-exclusive backup, the contents of the backup_label and tablespace_map are returned in the result of the function, and should be written to files in the -backup (and not in the data directory). +backup (and not in the data directory). When executed on a primary +pg_stop_backup will wait for WAL to be archived when archiving +is enabled. On a standby pg_stop_backup will return +immediately without waiting so it's important to verify that all required +WAL segments have been archived. If write activity on the primary is low, it +may be useful to run pg_switch_wal on the primary in order to +trigger an immediate segment switch of the last required WAL. -The function also creates a backup history file in the transaction log +When executed on a primary, the function also creates a backup history file +in the write-ahead log archive area. The history file includes the label given to pg_start_backup, the starting and ending transaction log locations for the backup, and the starting and ending times of the backup. The return -- Sent via pgsql-hackers mailing list
Re: [HACKERS] More replication race conditions
On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquierwrote: > Today's run has finished with the same failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi=2017-08-27%2018%3A00%3A13 > Attached is a patch to make this code path wait that the transaction > has been replayed. We could use as well synchronous_commit = apply, > but I prefer the solution of this patch with a wait query. I have added an open item for this issue for PG10. The 2PC tests in src/test/recovery are new in PG10, introduced by 3082098. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix drop replication slot blocking instead of returning error
On Tue, Aug 29, 2017 at 12:13 PM, Alvaro Herrerawrote: > Hi Alvaro, > Simone Gotti wrote: > > Hi all, > > > > I noticed that in postgres 10beta3, calling pg_drop_replication_slot on an > > active slot will block until it's released instead of returning an error > > like > > done in pg 9.6. Since this is a change in the previous behavior and the docs > > wasn't changed I made a patch to restore the previous behavior. > > Changing that behavior was the entire point of the cited commit. Sorry, I was thinking that the new behavior was needed for internal future functions since the doc wasn't changed. > > A better fix, from my perspective, is to amend the docs as per the > attached patch. This is what would be useful for logical replication, > which is what replication slots were invented for in the first place. I don't know the reasons why the new behavior is better for logical replication so I trust you. We are using repl slots for physical replication. > If you disagree, let's discuss what other use cases you have, and we can > come up with alternatives that satisfy both. I just faced the opposite problem, in stolon [1], we currently rely on the previous behavior. i.e. we don't want to block waiting for a slot to be released (that under some partitioning conditions could not happen for a long time), but prefer to continue retrying the drop later. Now we partially avoid blocking timing out the drop operation after some seconds. Another idea will be to query the slot status before doing the drop but will lead to a race condition (probably the opposite that that commit was fixing) if the slot is acquired between the query and the drop. > I think a decent answer, > but one which would create a bit of extra churn, would be to have an > optional boolean flag in the command/function for "nowait", instead of > hardcoding either behavior. I think that this will be the best fix. I'm not sure on the policy of these commands and if backward compatibility will be better (in such case the old behavior should be the default and a new "wait" flag could be added). If the default behavior is going to change we have to add different code for postgres >= 10. Thanks, Simone. [1] https://github.com/sorintlab/stolon > > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuple-routing for certain partitioned tables not working as expected
On 2017/08/25 22:26, Robert Haas wrote: On Wed, Aug 23, 2017 at 4:55 AM, Etsuro Fujitawrote: Agreed, but I'd vote for fixing this in v10 as proposed; I agree that just ripping the CheckValidResultRel checks out entirely is not a good idea, but that seems OK to me at least as a fix just for v10. I'm still not on-board with having this be the one case where we don't do CheckValidResultRel. If we want to still call it but pass down some additional information that can selectively skip certain checks, I could probably live with that. Another idea would be to not do CheckValidResultRel for partitions in ExecSetupPartitionTupleRouting; instead, do that the first time the partition is chosen by ExecFindPartition, and if successfully checked, initialize the partition's ResultRelInfo and other stuff. (We could skip this after the first time by checking whether we already have a valid ResultRelInfo for the chosen partition.) That could allow us to not only call CheckValidResultRel the way it is, but avoid initializing useless partitions for which tuples aren't routed at all. At some point we've got to stop developing v10 and just let it be what it is. I agree on that point, but ISTM that this is more like a bug. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding support for Default partition in partitioning
On Mon, Aug 21, 2017 at 4:47 PM, Jeevan Ladhewrote: > > Hi, > > On Thu, Aug 17, 2017 at 3:29 PM, Jeevan Ladhe > wrote: >> >> Hi, >> >> On Tue, Aug 15, 2017 at 7:20 PM, Robert Haas >> wrote: >>> >>> On Wed, Jul 26, 2017 at 8:14 AM, Jeevan Ladhe >>> wrote: >>> > I have rebased the patches on the latest commit. >>> >>> This needs another rebase. >> >> >> I have rebased the patch and addressed your and Ashutosh comments on last >> set of patches. Thanks for the rebased patches. >> >> The current set of patches contains 6 patches as below: >> >> 0001: >> Refactoring existing ATExecAttachPartition code so that it can be used >> for >> default partitioning as well * Returns an expression tree describing the passed-in relation's partition - * constraint. + * constraint. If there are no partition constraints returns NULL e.g. in case + * default partition is the only partition. The first sentence uses singular constraint. The second uses plural. Given that partition bounds together form a single constraint we should use singular constraint in the second sentence as well. Do we want to add a similar comment in the prologue of generate_partition_qual(). The current wording there seems to cover this case, but do we want to explicitly mention this case? +if (!and_args) +result = NULL; While this is correct, I am increasingly seeing (and_args != NIL) usage. get_partition_qual_relid() is called from pg_get_partition_constraintdef(), constr_expr = get_partition_qual_relid(relationId); /* Quick exit if not a partition */ if (constr_expr == NULL) PG_RETURN_NULL(); The comment is now wrong since a default partition may have no constraints. May be rewrite it as simply, "Quick exit if no partition constraint." generate_partition_qual() has three callers and all of them are capable of handling NIL partition constraint for default partition. May be it's better to mention in the commit message that we have checked that the callers of this function can handle NIL partition constraint. >> >> 0002: >> This patch teaches the partitioning code to handle the NIL returned by >> get_qual_for_list(). >> This is needed because a default partition will not have any constraints >> in case >> it is the only partition of its parent. If the partition being dropped is the default partition, heap_drop_with_catalog() locks default partition twice, once as the default partition and the second time as the partition being dropped. So, it will be counted as locked twice. There doesn't seem to be any harm in this, since the lock will be help till the transaction ends, by when all the locks will be released. Same is the case with cache invalidation message. If we are dropping default partition, the cache invalidation message on "default partition" is not required. Again this might be harmless, but better to avoid it. Similar problems exists with ATExecDetachPartition(), default partition will be locked twice if it's being detached. +/* + * If this is a default partition, pg_partitioned_table must have it's + * OID as value of 'partdefid' for it's parent (i.e. rel) entry. + */ +if (castNode(PartitionBoundSpec, boundspec)->is_default) +{ +Oidpartdefid; + +partdefid = get_default_partition_oid(RelationGetRelid(rel)); +Assert(partdefid == inhrelid); +} Since an accidental change or database corruption may change the default partition OID in pg_partition_table. An Assert won't help in such a case. May be we should throw an error or at least report a warning. If we throw an error, the table will become useless (or even the database will become useless RelationBuildPartitionDesc is called from RelationCacheInitializePhase3() on such a corrupted table). To avoid that we may raise a warning. I am wondering whether we could avoid call to get_default_partition_oid() in the above block, thus avoiding a sys cache lookup. The sys cache search shouldn't be expensive since the cache should already have that entry, but still if we can avoid it, we save some CPU cycles. The default partition OID is stored in pg_partition_table catalog, which is looked up in RelationGetPartitionKey(), a function which precedes RelationGetPartitionDesc() everywhere. What if that RelationGetPartitionKey() also returns the default partition OID and the common caller passes it to RelationGetPartitionDesc()?. +/* A partition cannot be attached if there exists a default partition */ +defaultPartOid = get_default_partition_oid(RelationGetRelid(rel)); +if (OidIsValid(defaultPartOid)) +ereport(ERROR, +(errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("cannot attach a new partition to table \"%s\" having a default partition", +
Re: [HACKERS] hash partitioning based on v10Beta2
Hi, When the number of partitions and the data are more, adding new partitions, there will be some efficiency problems. I don't know how the solution you're talking about is how to implement a hash partition? ---youngHighGo Database: http://www.highgo.com On 8/28/2017 22:25,Robert Haaswrote: On Sat, Aug 26, 2017 at 12:40 AM, yang...@highgo.com wrote: > A partition table can be create as bellow: > > CREATE TABLE h1 PARTITION OF h; > CREATE TABLE h2 PARTITION OF h; > CREATE TABLE h3 PARTITION OF h; This syntax is very problematic for reasons that have been discussed on the existing hash partitioning thread. Fortunately, a solution has already been implemented... you're the third person to try to write a patch for this feature. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [POC] hash partitioning
Hi, This is my patch, before I forgot to add attachments, and the following address is also discussed. https://www.postgresql.org/message-id/2017082612390093777512%40highgo.com ---youngHighGo Database: http://www.highgo.com On 8/28/2017 16:28,Yugo Nagatawrote: Hi young, On Mon, 28 Aug 2017 15:33:46 +0800 "yang...@highgo.com" wrote: > Hello > > Looking at your hash partitioning syntax, I implemented a hash partition in a more concise way, with no need to determine the number of sub-tables, and dynamically add partitions. I think it is great work, but the current consensus about hash-partitioning supports Amul's patch[1], in which the syntax is different from the my original proposal. So, you will have to read Amul's patch and make a discussion if you still want to propose your implementation. Regards, [1] https://www.postgresql.org/message-id/CAAJ_b965A2oog=6efuhelexl3rmgfssb3g7lwkva1bw0wuj...@mail.gmail.com > > Description > > The hash partition's implement is on the basis of the original range / list partition,and using similar syntax. > > To create a partitioned table ,use: > > CREATE TABLE h (id int) PARTITION BY HASH(id); > > The partitioning key supports only one value, and I think the partition key can support multiple values, > which may be difficult to implement when querying, but it is not impossible. > > A partition table can be create as bellow: > > CREATE TABLE h1 PARTITION OF h; > CREATE TABLE h2 PARTITION OF h; > CREATE TABLE h3 PARTITION OF h; > > FOR VALUES clause cannot be used, and the partition bound is calclulated automatically as partition index of single integer value. > > An inserted record is stored in a partition whose index equals > DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, values[0])) % nparts/* Number of partitions */ > ; > In the above example, this is DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, id)) % 3; > > postgres=# insert into h select generate_series(1,20); > INSERT 0 20 > postgres=# select tableoid::regclass,* from h; > tableoid | id > --+ > h1 | 3 > h1 | 5 > h1 | 17 > h1 | 19 > h2 | 2 > h2 | 6 > h2 | 7 > h2 | 11 > h2 | 12 > h2 | 14 > h2 | 15 > h2 | 18 > h2 | 20 > h3 | 1 > h3 | 4 > h3 | 8 > h3 | 9 > h3 | 10 > h3 | 13 > h3 | 16 > (20 rows) > > The number of partitions here can be dynamically added, and if a new partition is created, the number of partitions changes, the calculated target partitions will change, and the same data is not reasonable in different partitions,So you need to re-calculate the existing data and insert the target partition when you create a new partition. > > postgres=# create table h4 partition of h; > CREATE TABLE > postgres=# select tableoid::regclass,* from h; > tableoid | id > --+ > h1 | 5 > h1 | 17 > h1 | 19 > h1 | 6 > h1 | 12 > h1 | 8 > h1 | 13 > h2 | 11 > h2 | 14 > h3 | 1 > h3 | 9 > h3 | 2 > h3 | 15 > h4 | 3 > h4 | 7 > h4 | 18 > h4 | 20 > h4 | 4 > h4 | 10 > h4 | 16 > (20 rows) > > When querying the data, the hash partition uses the same algorithm as the insertion, and filters out the table that does not need to be scanned. > > postgres=# explain analyze select * from h where id = 1; > QUERY PLAN > > Append (cost=0.00..41.88 rows=13 width=4) (actual time=0.020..0.023 rows=1 loops=1) >-> Seq Scan on h3 (cost=0.00..41.88 rows=13 width=4) (actual time=0.013..0.016 rows=1 loops=1) > Filter: (id = 1) > Rows Removed by Filter: 3 > Planning time: 0.346 ms > Execution time: 0.061 ms > (6 rows) > > postgres=# explain analyze select * from h where id in (1,5);; > QUERY PLAN > > Append (cost=0.00..83.75 rows=52 width=4) (actual time=0.016..0.028 rows=2 loops=1) >-> Seq Scan on h1 (cost=0.00..41.88 rows=26 width=4) (actual time=0.015..0.018 rows=1 loops=1) > Filter: (id = ANY ('{1,5}'::integer[])) > Rows Removed by Filter: 6 >-> Seq Scan on h3 (cost=0.00..41.88 rows=26 width=4) (actual time=0.005..0.007 rows=1 loops=1) >
Re: [HACKERS] [PATCH] Fix drop replication slot blocking instead of returning error
Simone Gotti wrote: > Hi all, > > I noticed that in postgres 10beta3, calling pg_drop_replication_slot on an > active slot will block until it's released instead of returning an error > like > done in pg 9.6. Since this is a change in the previous behavior and the docs > wasn't changed I made a patch to restore the previous behavior. Changing that behavior was the entire point of the cited commit. A better fix, from my perspective, is to amend the docs as per the attached patch. This is what would be useful for logical replication, which is what replication slots were invented for in the first place. If you disagree, let's discuss what other use cases you have, and we can come up with alternatives that satisfy both. I think a decent answer, but one which would create a bit of extra churn, would be to have an optional boolean flag in the command/function for "nowait", instead of hardcoding either behavior. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 5761b79c6ebc4f6dfaf4d2489e9bff47b9e0c3ad Mon Sep 17 00:00:00 2001 From: Alvaro HerreraDate: Tue, 29 Aug 2017 12:08:47 +0200 Subject: [PATCH] Document that DROP_REPLICATION_SLOT now waits --- doc/src/sgml/protocol.sgml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 7c012f59a3..e61d57a234 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2179,7 +2179,8 @@ The commands accepted in walsender mode are: Drops a replication slot, freeing any reserved server-side resources. If - the slot is currently in use by an active connection, this command fails. + the slot is currently in use by an active connection, this command waits + until the slot becomes free. If the slot is a logical slot that was created in a database other than the database the walsender is connected to, this command fails. -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [PATCH] Fix drop replication slot blocking instead of returning error
Noah Misch wrote: > On Thu, Aug 24, 2017 at 03:38:20PM +0200, Simone Gotti wrote: > > I noticed that in postgres 10beta3, calling pg_drop_replication_slot on an > > active slot will block until it's released instead of returning an error > > like > > done in pg 9.6. Since this is a change in the previous behavior and the docs > > wasn't changed I made a patch to restore the previous behavior. > > after git commit 9915de6c1cb calls to pg_drop_replication_slot or the > > replication protocol DROP_REPLICATION_SLOT command will block when a > > slot is in use instead of returning an error as before (as the doc > > states). > > > > This patch will set nowait to true to restore the previous > > behavior. > The above-described topic is currently a PostgreSQL 10 open item. Looking at it now -- next report tomorrow. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Separate log file for extension
On Fri, Aug 25, 2017 at 4:12 PM, Antonin Houskawrote: > Attached is a draft patch to allow extension to write log messages to a > separate file. Does it allow postgres core code to write log messages to multiple log files as well? I can imagine a use case where we want to write log messages to separate log files by error level or purpose (server log and SQL log etc). Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POC: Cache data in GetSnapshotData()
On Thu, Aug 3, 2017 at 2:16 AM, Andres Freundwrote: >I think this patch should have a "cover letter" explaining what it's > trying to achieve, how it does so and why it's safe/correct. Cache the SnapshotData for reuse: === In one of our perf analysis using perf tool it showed GetSnapshotData takes a very high percentage of CPU cycles on readonly workload when there is very high number of concurrent connections >= 64. Machine : cthulhu 8 node machine. -- Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 Thread(s) per core:2 Core(s) per socket:8 Socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family:6 Model: 47 Model name:Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz Stepping: 2 CPU MHz: 2128.000 BogoMIPS: 4266.63 Virtualization:VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 24576K NUMA node0 CPU(s): 0,65-71,96-103 NUMA node1 CPU(s): 72-79,104-111 NUMA node2 CPU(s): 80-87,112-119 NUMA node3 CPU(s): 88-95,120-127 NUMA node4 CPU(s): 1-8,33-40 NUMA node5 CPU(s): 9-16,41-48 NUMA node6 CPU(s): 17-24,49-56 NUMA node7 CPU(s): 25-32,57-64 On further analysis, it appears this mainly accounts to cache-misses happening while iterating through large proc array to compute the SnapshotData. Also, it seems mainly on read-only (read heavy) workload SnapshotData mostly remains same as no new(rarely a) transaction commits or rollbacks. Taking advantage of this we can save the previously computed SanspshotData in shared memory and in GetSnapshotData we can use the saved SnapshotData instead of computing same when it is still valid. A saved SnapsotData is valid until no open transaction end after it. [Perf cache-misses analysis of Base for 128 pgbench readonly clients] == Samples: 421K of event 'cache-misses', Event count (approx.): 160069274 Overhead Command Shared Object Symbol 18.63% postgres postgres[.] GetSnapshotData 11.98% postgres postgres[.] _bt_compare 10.20% postgres postgres[.] PinBuffer 8.63% postgres postgres[.] LWLockAttemptLock 6.50% postgres postgres[.] LWLockRelease [Perf cache-misses analysis with Patch for 128 pgbench readonly clients] Samples: 357K of event 'cache-misses', Event count (approx.): 102380622 Overhead Command Shared Object Symbol 0.27% postgres postgres[.] GetSnapshotData with the patch, it appears cache-misses events has reduced significantly. Test Setting: = Server configuration: ./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c wal_buffers=256MB & pgbench configuration: scale_factor = 300 ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S postgres The machine has 64 cores with this patch I can see server starts improvement after 64 clients. I have tested up to 256 clients. Which shows performance improvement nearly max 39% [1]. This is the best case for the patch where once computed snapshotData is reused further. The worst case seems to be small, quick write transactions which frequently invalidate the cached SnapshotData before it is reused by any next GetSnapshotData call. As of now, I tested simple update tests: pgbench -M Prepared -N on the same machine with the above server configuration. I do not see much change in TPS numbers. All TPS are median of 3 runs Clients TPS-With Patch 05 TPS-Base%Diff 1 752.461117755.186777 -0.3% 64 32171.296537 31202.153576 +3.1% 128 41059.660769 40061.929658 +2.49% I will do some profiling and find out why this case is not costing us some performance due to caching overhead. >> diff --git a/src/backend/storage/ipc/procarray.c >> b/src/backend/storage/ipc/procarray.c >> index a7e8cf2..4d7debe 100644 >> --- a/src/backend/storage/ipc/procarray.c >> +++ b/src/backend/storage/ipc/procarray.c >> @@ -366,11 +366,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid) >> (arrayP->numProcs - index - 1) * >> sizeof(int)); >> arrayP->pgprocnos[arrayP->numProcs - 1] = -1; /* for >> debugging */ >> arrayP->numProcs--; >> + ProcGlobal->is_snapshot_valid = false; >> LWLockRelease(ProcArrayLock); >> return; >>
Re: [HACKERS] postgres_fdw bug in 9.6
On 2017/08/21 20:37, Etsuro Fujita wrote: Attached is an updated version of the patch. I corrected some comments. New version attached. Best regards, Etsuro Fujita *** a/contrib/postgres_fdw/expected/postgres_fdw.out --- b/contrib/postgres_fdw/expected/postgres_fdw.out *** *** 1701,1722 SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t Output: t1.c1, t2.c1, t1.c3, t1.*, t2.* Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2) Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST FOR UPDATE OF r1 !-> Merge Join Output: t1.c1, t1.c3, t1.*, t2.c1, t2.* ! Merge Cond: (t1.c1 = t2.c1) ! -> Sort Output: t1.c1, t1.c3, t1.* !Sort Key: t1.c1 !-> Foreign Scan on public.ft1 t1 ! Output: t1.c1, t1.c3, t1.* ! Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE ! -> Sort Output: t2.c1, t2.* !Sort Key: t2.c1 !-> Foreign Scan on public.ft2 t2 ! Output: t2.c1, t2.* ! Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" ! (23 rows) SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR UPDATE OF t1; c1 | c1 --- 1701,1716 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.* Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2) Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST FOR UPDATE OF r1 !-> Nested Loop Output: t1.c1, t1.c3, t1.*, t2.c1, t2.* ! Join Filter: (t1.c1 = t2.c1) ! -> Foreign Scan on public.ft1 t1 Output: t1.c1, t1.c3, t1.* !Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE ! -> Foreign Scan on public.ft2 t2 Output: t2.c1, t2.* !Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" ! (17 rows) SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR UPDATE OF t1; c1 | c1 *** *** 1745,1766 SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t Output: t1.c1, t2.c1, t1.c3, t1.*, t2.* Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2) Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, r1."C 1" ASC NULLS LAST FOR UPDATE OF r1 FOR UPDATE OF r2 !-> Merge Join Output: t1.c1, t1.c3, t1.*, t2.c1, t2.* ! Merge Cond: (t1.c1 = t2.c1) ! -> Sort Output: t1.c1, t1.c3, t1.* !Sort Key: t1.c1 !-> Foreign Scan on public.ft1 t1 ! Output: t1.c1, t1.c3, t1.* ! Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE ! -> Sort Output: t2.c1, t2.* !Sort Key: t2.c1 !-> Foreign Scan on public.ft2 t2 ! Output: t2.c1, t2.* ! Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE ! (23 rows) SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR
Re: [HACKERS] pgbench: Skipping the creating primary keys after initialization
Hello, Patch applies and works. I would like to insist a little bit: the name was declared as an int but passed to init and used as a bool there before the patch. Conceptually what is meant is really a bool, and I see no added value bar saving a few strokes to have an int. ISTM that recent pgbench changes have started turning old int-for-bool habits into using bool when bool is meant. Since is_no_vacuum is a existing one, if we follow the habit we should change other similar variables as well: is_init_mode, do_vacuum_accounts and debug. And I think we should change them in a separated patch. Hmmm. The existing "is_no_vacuum" variable is typed *both* as int (in "main") and as bool (in "init"), called by main (yuk!). I see no reason to choose the bad one for the global:-) After more thought, I'm bit inclined to not have a short option for --custom-initialize because this option will be used for not primary cases. It would be better to save short options for future enhancements of pgbench. Thought? I like it as is, especially as now the associated value is a simple and short string, I think that it makes sense to have a simple and short option to trigger it. Moreover -I stands cleanly for "initialization", and the capital stands for something a little special which it is. Its to good to miss. I think that the "-I" it should be added to the "--help" line, as it is done with other short & long options. Repeating "-I f" results in multiple foreign key constraints: Foreign-key constraints: "pgbench_tellers_bid_fkey" FOREIGN KEY (bid) REFERENCES pgbench_branches(bid) "pgbench_tellers_bid_fkey1" FOREIGN KEY (bid) REFERENCES pgbench_branches(bid) "pgbench_tellers_bid_fkey2" FOREIGN KEY (bid) REFERENCES pgbench_branches(bid) "pgbench_tellers_bid_fkey3" FOREIGN KEY (bid) REFERENCES pgbench_branches(bid) I wonder if this could be avoided easily? Maybe by setting the constraint name explicitely so that the second one fails on the existing one, which is fine, like for primary keys? Or adding a DROP CONSTRAINT IF EXISTS before the CREATE CONSTRAINT, like for tables? Or doing nothing about it? I would prefer the first option. Maybe the initial cleanup (DROP TABLE) could be made an option added to the default, so that cleaning up the database could be achieved with some "pgbench -i -I c", instead of connecting and droping the tables one by one which I have done quite a few times... What do you think? Before it is definitely engraved, I'm thinking about the letters: c - cleanup t - create table d - data p - primary key f - foreign key v - vacuum I think it is mostly okay, but it is the last time to think about it. Using "d" for cleanup (drop) would mean finding another letter for filling in data... maybe "g" for data generation? "c" may have been chosen for "create table", but then would not be available for "cleanup". Thoughts? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] [postgresql 10 beta3] unrecognized node type: 90
On Tue, Aug 29, 2017 at 8:32 AM, Robert Haaswrote: > On Mon, Aug 28, 2017 at 10:17 PM, Tom Lane wrote: >> In the meantime, I think what we should do is commit the bug fix more or >> less as I have it, and then work on Amit's concern about losing parallel >> efficiency by separating the resetting of shared parallel-scan state >> into a new plan tree traversal that's done before launching new worker >> processes. >> Sounds reasonable plan to me. >> The only real alternative is to lobotomize the existing rescan >> optimizations, and that seems like a really poor choice from here. > > There's already ExecParallelReinitialize, which could be made to walk > the nodes in addition to what it does already, but I don't understand > exactly what else needs fixing. > Sure, but it is not advisable to reset state of all the nodes below gather at that place, otherwise, it will be more or less like we are forcing rescan of each node. I think there we can reset the shared parallel state of parallel-aware nodes (or anything related) and then allow rescan to reset the master backend specific state for all nodes beneath gather. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] assorted code cleanup
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed I've reviewed the code changes, and it's pretty clear to me that they clean things up a bit while not changing any behavior. They simplify things in a way that make the code more comprehensible. I've run all the tests and they behave the same way as they did before the patch. I also trying manually playing around the the function in question, `metaphone`, and it seems to behave the same as before. I think it's ready to commit! The new status of this patch is: Ready for Committer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Tue, Aug 22, 2017 at 5:44 PM, amul sulwrote: > On Fri, Aug 18, 2017 at 11:01 PM, Robert Haas > wrote: > >> On Fri, Aug 18, 2017 at 1:12 PM, amul sul wrote: >> > I have a small query, what if I want a cache entry with extended hash >> > function instead standard one, I might require that while adding >> > hash_array_extended function? Do you think we need to extend >> > lookup_type_cache() as well? >> >> Hmm, I thought you had changed the hash partitioning stuff so that it >> didn't rely on lookup_type_cache(). You have to look up the function >> using the opclass provided in the partition key definition; >> lookup_type_cache() will give you the default one for the datatype. >> Maybe just use get_opfamily_proc? >> >> > Yes, we can do that for > the > partitioning code, but my concern is a little bit > different. I apologize, I wasn't clear enough. > > I am trying to extend hash_array & hash_range function. The hash_array and > hash_range function calculates hash by using the respective hash function > for > the given argument type (i.e. array/range element type), and those hash > functions are made available in the TypeCacheEntry via > lookup_type_cache(). But > in the hash_array & hash_range extended version requires a respective > extended > hash function for those element type. > > I have added hash_array_extended & hash_range_extended function in the > attached > patch 0001, which maintains a local copy of TypeCacheEntry with extended > hash > functions. But I am a little bit skeptic about this logic, any > > advice/suggestions will be > greatly appreciated. > > Instead, in the attached patch, I have modified lookup_type_cache() to request it to get extended hash function in the TypeCacheEntry. For that, I've introduced new flags as TYPECACHE_HASH_EXTENDED_PROC, TYPECACHE_HASH_EXTENDED_PROC_FINFO & TCFLAGS_CHECKED_HASH_EXTENDED_PROC, and additional variables in TypeCacheEntry structure to hold extended hash proc information. > The logic in the rest of the extended hash functions is same as the > standard > one. > Same for the hash_array_extended() & hash_range_extended() function as well. Regards, Amul 0001-add-optional-second-hash-proc-v2.patch Description: Binary data 0002-test-Hash_functions_wip.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Thu, Aug 24, 2017 at 4:27 PM, Masahiko Sawadawrote: > On Thu, Aug 24, 2017 at 3:11 AM, Josh Berkus wrote: >> On 08/22/2017 11:04 PM, Masahiko Sawada wrote: >>> WARNING: what you did is ok, but you might have wanted to do something else >>> >>> First of all, whether or not that can properly be called a warning is >>> highly debatable. Also, if you do that sort of thing to your spouse >>> and/or children, they call it "nagging". I don't think users will >>> like it any more than family members do. >> >> Realistically, we'll support the backwards-compatible syntax for 3-5 >> years. Which is fine. >> >> I suggest that we just gradually deprecate the old syntax from the docs, >> and then around Postgres 16 eliminate it. I posit that that's better >> than changing the meaning of the old syntax out from under people. >> > > It seems to me that there is no folk who intently votes for making the > quorum commit the default. There some folks suggest to keep backward > compatibility in PG10 and gradually deprecate the old syntax. And only > the issuing from docs can be possible in PG10. > According to the discussion so far, it seems to me that keeping backward compatibility and issuing a warning in docs that old syntax could be changed or removed in a future release is the most acceptable way in PG10. There is no objection against that so far and I already posted a patch to add a warning in docs[1]. I'll wait for the committer's decision. [1] https://www.postgresql.org/message-id/CAD21AoAe%2BoGSFi3bjZ%2BfW6Q%3DTK7avPdDCZLEr02zM_c-U0JsRA%40mail.gmail.com Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POC: Cache data in GetSnapshotData()
Hi all please ignore this mail, this email is incomplete I have to add more information and Sorry accidentally I pressed the send button while replying. On Tue, Aug 29, 2017 at 11:27 AM, Mithun Cywrote: > Cache the SnapshotData for reuse: > === > In one of our perf analysis using perf tool it showed GetSnapshotData > takes a very high percentage of CPU cycles on readonly workload when > there is very high number of concurrent connections >= 64. > > Machine : cthulhu 8 node machine. > -- > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):128 > On-line CPU(s) list: 0-127 > Thread(s) per core:2 > Core(s) per socket:8 > Socket(s): 8 > NUMA node(s): 8 > Vendor ID: GenuineIntel > CPU family:6 > Model: 47 > Model name:Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz > Stepping: 2 > CPU MHz: 2128.000 > BogoMIPS: 4266.63 > Virtualization:VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 24576K > NUMA node0 CPU(s): 0,65-71,96-103 > NUMA node1 CPU(s): 72-79,104-111 > NUMA node2 CPU(s): 80-87,112-119 > NUMA node3 CPU(s): 88-95,120-127 > NUMA node4 CPU(s): 1-8,33-40 > NUMA node5 CPU(s): 9-16,41-48 > NUMA node6 CPU(s): 17-24,49-56 > NUMA node7 CPU(s): 25-32,57-64 > > Perf CPU cycle 128 clients. > > On further analysis, it appears this mainly accounts to cache-misses > happening while iterating through large proc array to compute the > SnapshotData. Also, it seems mainly on read-only (read heavy) workload > SnapshotData mostly remains same as no new(rarely a) transaction > commits or rollbacks. Taking advantage of this we can save the > previously computed SanspshotData in shared memory and in > GetSnapshotData we can use the saved SnapshotData instead of computing > same when it is still valid. A saved SnapsotData is valid until no > transaction end. > > [Perf analysis Base] > > Samples: 421K of event 'cache-misses', Event count (approx.): 160069274 > Overhead Command Shared Object Symbol > 18.63% postgres postgres[.] GetSnapshotData > 11.98% postgres postgres[.] _bt_compare > 10.20% postgres postgres[.] PinBuffer > 8.63% postgres postgres[.] LWLockAttemptLock > 6.50% postgres postgres[.] LWLockRelease > > > [Perf analysis with Patch] > > > Server configuration: > ./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c > max_wal_size=20GB -c checkpoint_timeout=900 -c > maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c > wal_buffers=256MB & > > pgbench configuration: > scale_factor = 300 > ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S > postgres > > The machine has 64 cores with this patch I can see server starts > improvement after 64 clients. I have tested up to 256 clients. Which > shows performance improvement nearly max 39% [1]. This is the best > case for the patch where once computed snapshotData is reused further. > > The worst case seems to be small, quick write transactions with which > frequently invalidate the cached SnapshotData before it is reused by > any next GetSnapshotData call. As of now, I tested simple update > tests: pgbench -M Prepared -N on the same machine with the above > server configuration. I do not see much change in TPS numbers. > > All TPS are median of 3 runs > Clients TPS-With Patch 05 TPS-Base%Diff > 1 752.461117755.186777 -0.3% > 64 32171.296537 31202.153576 +3.1% > 128 41059.660769 40061.929658 +2.49% > > I will do some profiling and find out why this case is not costing us > some performance due to caching overhead. > > > [] > > On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund wrote: >> Hi, >> >> I think this patch should have a "cover letter" explaining what it's >> trying to achieve, how it does so and why it's safe/correct. I think >> it'd also be good to try to show some of the worst cases of this patch >> (i.e. where the caching only adds overhead, but never gets used), and >> some of the best cases (where the cache gets used quite often, and >> reduces contention significantly). >> >> I wonder if there's a way to avoid copying the snapshot into the cache >> in situations where we're barely ever going to need it. But right now >> the only way I can think of is to look at the length of the >> ProcArrayLock wait queue and count the exclusive locks therein - which'd >> be expensive and intrusive... >> >> >> On 2017-08-02 15:56:21 +0530, Mithun Cy wrote: >>> diff --git a/src/backend/storage/ipc/procarray.c >>>
Re: [HACKERS] MAIN, Uncompressed?
On 26 August 2017 at 05:40, Mark Kirkwoodwrote: > On 26/08/17 12:18, Simon Riggs wrote: > >> On 25 August 2017 at 20:53, Tom Lane wrote: >>> >>> Greg Stark writes: I think this is a particularly old piece of code and we're lucky the default heuristics have served well for all this time because I doubt many people fiddle with these storage attributes. The time may have come to come up with a better UI for the storage attributes because people are doing new things (like json) and wanting more control over this heuristic. >>> >>> Yeah, I could get behind a basic rethinking of the tuptoaster control >>> knobs. I'm just not in love with Simon's specific proposal, especially >>> not if we can't think of a better name for it than "MAINU". >> >> Extended/External would be just fine if you could set the toast >> target, so I think a better suggestion would be to make "toast_target" >> a per-attribute option . >> > > +1, have thought about this myself previouslythank you for bringing it > up! OK, so table-level option for "toast_tuple_target", not attribute-level option The attached patch and test shows this concept is useful and doesn't affect existing data. For 4x 4000 byte rows: * by default we use 1 heap block and 3 toast blocks * toast_tuple_target=4080 uses 2 heap blocks and 0 toast blocks -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services toast_tuple_target.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] standby server crashes hard on out-of-disk-space in HEAD
On Mon, Aug 28, 2017 at 9:06 PM, Michael Paquierwrote: > On Tue, Jun 13, 2017 at 4:21 AM, Andres Freund wrote: >> On 2017-06-12 15:12:23 -0400, Robert Haas wrote: >>> Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local >>> tracking of buffer pins memory efficient., vintage 2014) seems like a >>> likely culprit here, but I haven't tested. >> >> I'm not that sure. As written above, the Assert isn't new, and given >> this hasn't been reported before, I'm a bit doubtful that it's a general >> refcount tracking bug. The FPI code has been whacked around more >> heavily, so it could well be a bug in it somewhere. > > Something doing a bisect could just use a VM that puts the standby on > a tiny partition. I remember seeing this assertion failure some time > ago on a test deployment, and that was really surprising. I think that > this may be hiding something, so we should really try to investigate > more what's wrong here. I have been playing a bit with the builds and the attached script triggering out-of-space errors on a standby (adapt to your environment), and while looking for a good commit, I have found that this thing is a bit older than the 2014 vintage... Down to the merge-base of REL9_4_STABLE and REL9_3_STABLE, the assertion failure is still the same. The failure is older than even 9.2, for example by testing at the merge-base of 9.2 and 9.3: CONTEXT: xlog redo insert(init): rel 1663/16384/16385; tid 181441/1 TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1788) But well this assertion got changed in dcafdbcd. -- Michael crash_standby.bash Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] show "aggressive" or not in autovacuum logs
On Tue, Aug 29, 2017 at 10:16 AM, Robert Haaswrote: > On Mon, Aug 28, 2017 at 5:26 AM, Kyotaro HORIGUCHI > wrote: >> Currently the message shows the '%d skipped-frozen' message but >> it is insufficient to verify the true effect. This is a patch to >> show mode as 'aggressive' or 'normal' in the closing message of >> vacuum. %d frozen-skipped when 'aggressive mode' shows the true >> effect of ALL_FROZEN. >> >> I will add this patch to CF2017-09. > > I would be a bit inclined to somehow show aggressive if it's > aggressive and not insert anything at all otherwise. That'd probably > require two separate translatable strings in each case, but maybe > that's OK. > > What do other people think? FWIW I prefer the Robert's idea; not insert anything if normal vacuum. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers