Re: [HACKERS] Why does plpython delay composite type resolution?
On 12/21/2016 04:14 AM, Jim Nasby wrote: Why do functions that accept composite types delay type resolution until execution? I have a naive patch that speeds up plpy.execute() by 8% by caching interred python strings for the dictionary key names (which are repeated over and over). The next step is to just pre-allocate those strings as appropriate for the calling context, but it's not clear how to handle that for input arguments. Does your patch handle "ALTER TYPE name ADD ATTRIBUTE ..."? My immediate guess would be that it could be a cache invalidation thing. Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
On Wed, Dec 21, 2016 at 4:14 AM, Craig Ringer wrote: > Re the patch, I don't like > > - static bool master_has_standby_xmin = false; > + static bool master_has_standby_xmin = true; > > without any comment. It's addressed in the comment changes on the > header func, but the link isn't obvious. Maybe a oneliner to say > "ensure we always send at least one feedback message" ? Will fix. > I think this part of the patch is correct and useful. > > > > I don't see the point of > > +* > +* If Hot Standby is not yet active we reset the xmin value. Check > this > +* after the interval has expired to reduce number of calls. > */ > - if (hot_standby_feedback) > + if (hot_standby_feedback && HotStandbyActive()) > > though. Forcing feedback to send InvalidTransactionId until hot > standby feedback is actively running seems counterproductive; we want > to lock in feedback as soon as possible, not wait until we're > accepting transactions. Simon and I are in fact working on changes to > do the opposite of what you've got here and ensure that we don't allow > hot standby connections until we know hot_standby_feedback is in > effect and a usable xmin is locked in. That way the master won't > remove tuples we need as soon as we start an xact and cause a conflict > with recovery. > > If there are no running xacts, GetOldestXmin() will return the slot > xmin if any. We should definitely not be clearing that just because > we're not accepting hot standby connections yet; we want to make very > sure it remains in effect unless the user explicitly de-configures hot > standby. > > (There's another pre-exisitng problem there, where we can start the > walsender before slots are initialized, that I'll be addressing > separately). > > If there's no slot xmin either, we'll send nextXid. That's a sensible > starting point for hot standby feedback until we start some > transactions. > > So -1 on this part of the patch, unless there's something I've misunderstood. Currently there was no feedback sent if hot standby was not active. I was not sure if it was safe to call GetOldestXmin() in that case. However I did not consider cascading replica slots wanting to hold back xmin, where resetting the parents xmin is indeed wrong. Do you know if GetOldestXmin() is safe at this point and we can just remove the HotStandbyActive() check? Otherwise I think the correct approach is to move the check and return inside the hot_standby_feedback case like this: if (hot_standby_feedback) { if (!HotStandbyActive()) return; Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] too low cost of Bitmap index scan
2016-12-21 0:01 GMT+01:00 Tom Lane : > Pavel Stehule writes: > > I am trying to fix slow query on PostgreSQL 9.5.4. > > The data are almost in RAM > > If it's all in RAM, you'd likely be well-served to lower random_page_cost. > It looks to me like the planner is estimating pretty accurately how many > heap fetches will be eliminated by using the extra index; where it's off > seems to be in the cost of those heap fetches relative to the index work. > When I decrease random page cost, then the cost of bitmapscan was decreased too https://explain.depesz.com/s/7CAJ .. random page cost 2 https://explain.depesz.com/s/iEBW .. random page cost 2, bitmapscan off https://explain.depesz.com/s/W4zw .. random page cost 2 https://explain.depesz.com/s/Gar .. random page cost 1, bitmapscan off I played with other costs, but without any success, the cost of bitmapscan is significantly cheaper then index scan. Regards Pavel > regards, tom lane >
Re: [HACKERS] Logical decoding on standby
On 21/12/16 04:06, Craig Ringer wrote: > On 20 December 2016 at 15:03, Petr Jelinek > wrote: > >>> The biggest change in this patch, and the main intrusive part, is that >>> procArray->replication_slot_catalog_xmin is no longer directly used by >>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is >>> added, with a corresponding CheckPoint field. >>> [snip] >> >> If this mechanism would not be needed most of the time, wouldn't it be >> better to not have it and just have a way to ask physical slot about >> what's the current reserved catalog_xmin (in most cases the standby >> should actually know what it is since it's sending the hs_feedback, but >> first slot created on such standby may need to check). > > Yes, and that was actually my originally preferred approach, though > the one above does offer the advantage that if something goes wrong we > can detect it and fail gracefully. Possibly not worth the complexity > though. > > Your approach requires us to make very sure that hot_standby_feedback > does not get turned off by user or become ineffective once we're > replicating, though, since we won't have any way to detect when needed > tuples are removed. We'd probably just bail out with relcache/syscache > lookup errors, but I can't guarantee we wouldn't crash if we tried > logical decoding on WAL where needed catalogs have been removed. > > I initially ran into trouble doing that, but now think I have a way forward. > >> WRT preventing >> hs_feedback going off, we can refuse to start with hs_feedback off when >> there are logical slots detected. > > Yes. There are some ordering issues there though. We load slots quite > late in startup and they don't get tracked in checkpoints. So we might > launch the walreceiver before we load slots and notice their needed > xmin/catalog_xmin. So we need to prevent sending of > hot_standby_feedback until slots are loaded, or load slots earlier in > startup. The former sounds less intrusive and safer - probably just > add an "initialized" flag to ReplicationSlotCtlData and suppress > hs_feedback until it becomes true. > > We'd also have to suppress the validation callback action on the > hot_standby_feedback GUC until we know replication slot state is > initialised, and perform the check during slot startup instead. The > hot_standby_feedback GUC validation check might get called before > shmem is even set up so we have to guard against attempts to access a > shmem segment that may not event exist yet. > > The general idea is workable though. Refuse to start if logical slots > exist and hot_standby_feedback is off or walreceiver isn't using a > physical slot. Refuse to allow hot_standby_feedback to change > These are all problems associated with replication slots being in shmem if I understand correctly. I wonder, could we put just bool someplace which is available early that says if there are any logical slots defined? We don't actually need all the slot info, just to know if there are some. > >> You may ask what if user drops the slot and recreates or somehow >> otherwise messes up catalog_xmin on master, well, considering that under >> my proposal we'd first (on connect) check the slot for catalog_xmin we'd >> know about it so we'd either mark the local slots broken/drop them or >> plainly refuse to connect to the master same way as if it didn't have >> required WAL anymore (not sure which behavior is better). Note that user >> could mess up catalog_xmin even in your design the same way, so it's not >> really a regression. > > Agreed. Checking catalog_xmin of the slot when we connect is > sufficient to guard against that, assuming we can trust that the > catalog_xmin is actually in effect on the master. Consider cascading > setups, where we set our catalog_xmin but it might not be "locked in" > until the middle cascaded server relays it to the master. > > I have a proposed solution to that which I'll outline in a separate > patch+post; it ties in to some work on addressing the race between hot > standby feedback taking effect and queries starting on the hot > standby. It boils down to "add a hot_standby_feedback reply protocol > message". > >> Plus >> it might even be okay to only allow creating logical slots on standbys >> connected directly to master in v1. > > True. I didn't consider that. > > We haven't had much luck in the past with such limitations, but > personally I'd consider it a perfectly reasonable one. > I think it's infinitely better with that limitation than the status quo. Especially for failover scenario (you usually won't failover to replica down the cascade as it's always more behind). Not to mention that with every level of cascading you get automatically more lag which means more bloat so it might not even be all that desirable to go that route immediately in v1 when we don't have way to control that bloat/maximum slot lag. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Dev
Re: [HACKERS] Logical decoding on standby
On 21/12/16 04:31, Craig Ringer wrote: > On 20 December 2016 at 15:03, Petr Jelinek > wrote: > >> But in 0003 I don't understand following code: >>> + if (endpos != InvalidXLogRecPtr && !do_start_slot) >>> + { >>> + fprintf(stderr, >>> + _("%s: cannot use --create-slot or >>> --drop-slot together with --endpos\n"), >>> + progname); >>> + fprintf(stderr, _("Try \"%s --help\" for more >>> information.\n"), >>> + progname); >>> + exit(1); >>> + } >> >> Why is --create-slot and --endpos not allowed together? > > Actually, the test is fine, the error is just misleading due to my > misunderstanding. > > The fix is simply to change the error message to > > _("%s: --endpos may only be specified > with --start\n"), > > so I won't post a separate followup patch. > Ah okay makes sense. The --create-slot + --endpos should definitely be allowed combination, especially now that we can extend this to optionally use temporary slot. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background contrib module proposal
2016-12-21 4:55 GMT+05:00 David Fetter : > I see. > > I find the following a little easier to follow, and the sleeps still > work even when very short. Now I know a little more SQL :) Thank you. I'm not sure every platform supports microsecond sleeps. But it will work anyway. This test is deterministic. Without sequence here is no race condition. So it is not sleepsort, it is deterministic. Though I agree that it is good thing for test, I'd still add some miliseconds to test case when main query for sure have to wait end of other sleeping query. . > > CREATE TABLE input AS > SELECT x, row_number() OVER (ORDER BY x) n > FROM > generate_series(0,.05,0.01) x > ORDER BY x DESC; > > CREATE TABLE output(place int,value float); > > CREATE TABLE handles AS > SELECT pg_background_launch( > format($$ > SELECT pg_sleep(%s); > INSERT INTO output VALUES (%s, %s) > $$, > x, n, x > ) > ) h > FROM input; > > --wait until everyone finishes > SELECT * FROM handles JOIN LATERAL pg_background_result(handles.h) AS (x > TEXT) ON (true); > > --output results > SELECT * FROM output ORDER BY place; Best regrards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bigint vs txid user confusion
On 12/20/16 10:20 PM, Craig Ringer wrote: Tools look at pg_class.relfrozenxid and pg_databse.datfrozenxid more than probably anything else, so making changes that ignores them is pretty pointless. Except the only useful way I know of to access *frozenxid is using age(), and even that is a royal PITA when the xid is a special xid. So I'd argue that we should effectively remove xid from user's view. Even if we don't want to bloat pg_class by 4 bytes, we should just make xid even more opaque than it is today and tell users to just cast it to bigxid. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] simplehash vs. pgindent
Robert Haas writes: > In commit b30d3ea824c5ccba43d3e942704f20686e7dbab8, when Andres added > the simplehash stuff, he also added SH_TYPE, SH_ITERATOR, and > SH_STATUS to typedefs.list. When I subsequently updated typedefs.list > from the buildfarm in acddbe221b084956a0efd6e4b6c6586e8fd994d7, it of > course removed those entries since they are merely placeholders for > real types, not anything that would ever appear in a symbol table. So > now if you pgindent the simplehash stuff, it does stupid things. > I think we should add a file src/tools/pgindent/typedefs.extra.list. > Somehow teach pgindent to search whatever directory it searches for > typedefs.list for typedefs.extra.list as well. Then we can put stuff > in there that the buildfarm doesn't find, and pgindent can combine the > files in load_typedefs(). That way we can still update typedefs.list > straight from the buildfarm, but there's a place to put manual entries > that we don't want to lose. We could also teach load_typedefs() to > strip out lines that are empty or start with #, and then we'd be able > to put comments in typedefs.extra.list explaining why each entry needs > to be present there rather than relying on the normal mechanism. Or we could just fix the code so it doesn't use phony typedefs? If a typedef is never used as a variable or field type, how much do we really need it? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016/12/21 14:03, Amit Langote wrote: > OK, updated patch attached. Oops, incomplete patch that was. Complete patch attached this time. Thanks, Amit diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 1c219b03dd..115b98313e 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13297,8 +13297,11 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd) } } + /* It's safe to skip the validation scan after all */ if (skip_validate) - elog(NOTICE, "skipping scan to validate partition constraint"); + ereport(INFO, +(errmsg("partition constraint for table \"%s\" is implied by existing constraints", + RelationGetRelationName(attachRel; /* * Set up to have the table to be scanned to validate the partition diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index 99e20eb922..62e18961d3 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -3179,7 +3179,7 @@ ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4); ALTER TABLE list_parted2 DETACH PARTITION part_3_4; ALTER TABLE part_3_4 ALTER a SET NOT NULL; ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4); -NOTICE: skipping scan to validate partition constraint +INFO: partition constraint for table "part_3_4" is implied by existing constraints -- check validation when attaching range partitions CREATE TABLE range_parted ( a int, @@ -3204,7 +3204,7 @@ CREATE TABLE part2 ( b int NOT NULL CHECK (b >= 10 AND b < 18) ); ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20); -NOTICE: skipping scan to validate partition constraint +INFO: partition constraint for table "part2" is implied by existing constraints -- check that leaf partitions are scanned when attaching a partitioned -- table CREATE TABLE part_5 ( @@ -3219,7 +3219,7 @@ ERROR: partition constraint is violated by some row DELETE FROM part_5_a WHERE a NOT IN (3); ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL; ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5); -NOTICE: skipping scan to validate partition constraint +INFO: partition constraint for table "part_5" is implied by existing constraints -- check that the table being attached is not already a partition ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2); ERROR: "part_2" is already a partition -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BUG: pg_stat_statements query normalization issues with combined queries
At Tue, 20 Dec 2016 22:42:48 -0500, Robert Haas wrote in > On Tue, Dec 20, 2016 at 6:18 AM, Fabien COELHO wrote: > > Would this approach be acceptable, or is modifying Nodes a no-go area? > > > > If it is acceptable, I can probably put together a patch and submit it. > > > > If not, I suggest to update the documentation to tell that > > pg_stat_statements does not work properly with combined queries. > > I think you've found a bug, but I'm a little doubtful about your > proposed fix. However, I haven't studied the code, so I don't know > what other approach might be better. That will work and doable, but the more fundamental issue here seems to be that post_parse_analyze_hook or other similar hooks are called with a Query incosistent with query_debug_string. It is very conveniently used but the name seems to me to be suggesting that such usage is out of purpose. I'm not sure, though. A similar behavior is shown in error messages but this harms far less than the case of pg_stat_statements. ERROR: column "b" does not exist LINE 1: ...elect * from t where a = 1; select * from t where b = 2; com... ^ 1. Let all of the parse node have location in debug_query_string. (The original proposal) 2. Let the parser return once for each statement (like psql parser) and corresponding query string is stored in a varialble other than debug_query_string. ... regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
Rebased on the current head. On Tue, Dec 13, 2016 at 12:06 PM, Dilip Kumar wrote: > On Sat, Dec 10, 2016 at 5:36 PM, Amit Kapila wrote: >> Few assorted comments: > > Thanks for the review > >> >> 1. >> + else if (needWait) >> + { >> + /* Add ourself to wait queue */ >> + ConditionVariablePrepareToSleep(&pbminfo->cv); >> + queuedSelf = true; >> + } >> >> With the committed version of condition variables, you can avoid >> calling ConditionVariablePrepareToSleep(). Refer latest parallel >> index scan patch [1]. > > Oh, I see, Fixed. >> >> 2. >> + ParallelBitmapScan >> + Waiting for leader to complete the TidBitmap. >> + >> + >> >> Isn't it better to write it as below: >> >> Waiting for the leader backend to form the TidBitmap. > Done this way.. >> >> 3. >> + * Update snpashot info in heap scan descriptor. >> >> /snpashot/snapshot > Fixed >> >> 4. >> #include "utils/tqual.h" >> - >> +#include "miscadmin.h" >> >> static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres); >> - >> +static bool pbms_is_leader(ParallelBitmapInfo *pbminfo); >> >> TBMIterateResult *tbmres; >> - >> + ParallelBitmapInfo *pbminfo = node->parallel_bitmap; >> >> static int tbm_comparator(const void *left, const void *right); >> - >> +void * tbm_alloc_shared(Size size, void *arg); >> >> It seems line deletes at above places are spurious. Please check for >> similar occurrences at other places in patch. >> > Fixed >> 5. >> + bool is_shared; /* need to build shared tbm if set*/ >> >> space is required towards the end of the comment (set */). > Fixed >> >> 6. >> + /* >> + * If we have shared TBM means we are running in parallel mode. >> + * So directly convert dsa array to page and chunk array. >> + */ >> >> I think the above comment can be simplified as "For shared TBM, >> directly convert dsa array to page and chunk array" > > Done this way.. >> >> 7. >> + dsa_entry = (void*)(((char*)dsa_entry) + sizeof(dsa_pointer)); >> >> extra space is missing at multiple places in above line. It should be >> written as below: >> >> dsa_entry = (void *)(((char *) dsa_entry) + sizeof(dsa_pointer)); > Fixed.. > >> >> Similar stuff needs to be taken care at other places in the patch as >> well. I think it will be better if you run pgindent on your patch. > > I have run the pgindent, and taken all the changes whichever was > applicable for added code. > > There are some cleanup for "hash-support-alloc-free" also, so > attaching a new patch for this as well. > > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com hash-support-alloc-free-v5.patch Description: Binary data parallel-bitmap-heap-scan-v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016/12/21 13:42, Robert Haas wrote: > On Tue, Dec 20, 2016 at 9:14 PM, Amit Langote > wrote: >> On 2016/12/21 1:45, Alvaro Herrera wrote: >>> Robert Haas wrote: On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera wrote: > Even if we decide to keep the message, I think it's not very good > wording anyhow; as a translator I disliked it on sight. Instead of > "skipping scan to validate" I would use "skipping validation scan", > except that it's not clear what it is we're validating. Mentioning > partition constraint in errcontext() doesn't like a great solution, but > I can't think of anything better. Maybe something like: partition constraint for table \"%s\" is implied by existing constraints >>> >>> Actually, shouldn't we emit a message if we *don't* skip the check? >> >> Scanning (aka, not skipping) to validate the partition constraint is the >> default behavior, so a user would be expecting it anyway, IOW, need not be >> informed of it. But when ATExecAttachPartition's efforts to avoid the >> scan by comparing the partition constraint against existing constraints >> (which the user most probably deliberately added just for this) succeed, >> that seems like a better piece of information to provide the user with, >> IMHO. But then again, having a message printed before a potentially long >> validation scan seems like something a user would like to see, to know >> what it is that is going to take so long. Hmm. >> >> Anyway, what would the opposite of Robert's suggested message look like: >> "scanning table \"%s\" to validate partition constraint"? > > Maybe: partition constraint for table \"%s\" is implied by existing > constraints OK, updated patch attached. Thanks, Amit diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index 2a324c0b49..62e18961d3 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -3179,7 +3179,7 @@ ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4); ALTER TABLE list_parted2 DETACH PARTITION part_3_4; ALTER TABLE part_3_4 ALTER a SET NOT NULL; ALTER TABLE list_parted2 ATTACH PARTITION part_3_4 FOR VALUES IN (3, 4); -INFO: skipping scan to validate partition constraint +INFO: partition constraint for table "part_3_4" is implied by existing constraints -- check validation when attaching range partitions CREATE TABLE range_parted ( a int, @@ -3204,7 +3204,7 @@ CREATE TABLE part2 ( b int NOT NULL CHECK (b >= 10 AND b < 18) ); ALTER TABLE range_parted ATTACH PARTITION part2 FOR VALUES FROM (1, 10) TO (1, 20); -INFO: skipping scan to validate partition constraint +INFO: partition constraint for table "part2" is implied by existing constraints -- check that leaf partitions are scanned when attaching a partitioned -- table CREATE TABLE part_5 ( @@ -3219,7 +3219,7 @@ ERROR: partition constraint is violated by some row DELETE FROM part_5_a WHERE a NOT IN (3); ALTER TABLE part_5 ADD CONSTRAINT check_a CHECK (a IN (5)), ALTER a SET NOT NULL; ALTER TABLE list_parted2 ATTACH PARTITION part_5 FOR VALUES IN (5); -INFO: skipping scan to validate partition constraint +INFO: partition constraint for table "part_5" is implied by existing constraints -- check that the table being attached is not already a partition ALTER TABLE list_parted2 ATTACH PARTITION part_2 FOR VALUES IN (2); ERROR: "part_2" is already a partition -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] simplehash vs. pgindent
In commit b30d3ea824c5ccba43d3e942704f20686e7dbab8, when Andres added the simplehash stuff, he also added SH_TYPE, SH_ITERATOR, and SH_STATUS to typedefs.list. When I subsequently updated typedefs.list from the buildfarm in acddbe221b084956a0efd6e4b6c6586e8fd994d7, it of course removed those entries since they are merely placeholders for real types, not anything that would ever appear in a symbol table. So now if you pgindent the simplehash stuff, it does stupid things. I think we should add a file src/tools/pgindent/typedefs.extra.list. Somehow teach pgindent to search whatever directory it searches for typedefs.list for typedefs.extra.list as well. Then we can put stuff in there that the buildfarm doesn't find, and pgindent can combine the files in load_typedefs(). That way we can still update typedefs.list straight from the buildfarm, but there's a place to put manual entries that we don't want to lose. We could also teach load_typedefs() to strip out lines that are empty or start with #, and then we'd be able to put comments in typedefs.extra.list explaining why each entry needs to be present there rather than relying on the normal mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Tue, Dec 20, 2016 at 9:14 PM, Amit Langote wrote: > On 2016/12/21 1:45, Alvaro Herrera wrote: >> Robert Haas wrote: >>> On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera >>> wrote: Even if we decide to keep the message, I think it's not very good wording anyhow; as a translator I disliked it on sight. Instead of "skipping scan to validate" I would use "skipping validation scan", except that it's not clear what it is we're validating. Mentioning partition constraint in errcontext() doesn't like a great solution, but I can't think of anything better. >>> >>> Maybe something like: partition constraint for table \"%s\" is implied >>> by existing constraints >> >> Actually, shouldn't we emit a message if we *don't* skip the check? > > Scanning (aka, not skipping) to validate the partition constraint is the > default behavior, so a user would be expecting it anyway, IOW, need not be > informed of it. But when ATExecAttachPartition's efforts to avoid the > scan by comparing the partition constraint against existing constraints > (which the user most probably deliberately added just for this) succeed, > that seems like a better piece of information to provide the user with, > IMHO. But then again, having a message printed before a potentially long > validation scan seems like something a user would like to see, to know > what it is that is going to take so long. Hmm. > > Anyway, what would the opposite of Robert's suggested message look like: > "scanning table \"%s\" to validate partition constraint"? Maybe: partition constraint for table \"%s\" is implied by existing constraints -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgstattuple documentation clarification
On Tue, Dec 20, 2016 at 10:01 AM, Tom Lane wrote: > Andrew Dunstan writes: >> Recently a client was confused because there was a substantial >> difference between the reported table_len of a table and the sum of the >> corresponding tuple_len, dead_tuple_len and free_space. The docs are >> fairly silent on this point, and I agree that in the absence of >> explanation it is confusing, so I propose that we add a clarification >> note along the lines of: > >> The table_len will always be greater than the sum of the tuple_len, >> dead_tuple_len and free_space. The difference is accounted for by >> page overhead and space that is not free but cannot be attributed to >> any particular tuple. > >> Or perhaps we should be more explicit and refer to the item pointers on >> the page. > > I find "not free but cannot be attributed to any particular tuple" > to be entirely useless weasel wording, not to mention wrong with > respect to item pointers in particular. > > Perhaps we should start counting the item pointers in tuple_len. > We'd still have to explain about page header overhead, but that > would be a pretty small and fixed-size discrepancy. It's pretty weird to count unused or dead line pointers as part of tuple_len, and it would screw things up for anybody trying to calculate the average width of their tuples, which is an entirely reasonable thing to want to do. I think if we're going to count item pointers as anything, it needs to be some new category -- either item pointers specifically, or an "other stuff" bucket. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : Parallel Merge Join
On Tue, Dec 13, 2016 at 8:34 PM, Dilip Kumar wrote: > I have created two patches as per the suggestion. > > 1. mergejoin_refactoring_v2.patch --> Move functionality of > considering various merge join path into new function. > 2. parallel_mergejoin_v2.patch -> This adds the functionality of > supporting partial mergejoin paths. This will apply on top of > mergejoin_refactoring_v2.patch. We have done further analysis of the performance with TPCH benchmark at higher scale factor. I have tested parallel merge join patch along with parallel index scan[1] I have observed that with query3, we are getting linear scalability ('explain analyze' results are attached). Test Setup: --- TPCH 300 scale factor work_mem = 1GB shared_buffer = 1GB random_page_cost=seq_page_cost=0.1 max_parallel_workers_per_gather=4 (warm cache ensured) The median of 3 runs (reading are quite stable). On Head: 2702568.099 ms With Patch: 547363.164 ms Other Experiments: * I have also verified reading on the head, without modifying random_page_cost=seq_page_cost, but there is no change in plan or execution time. * I have tried to increase the max_parallel_workers_per_gather to 8 but I did not observe further scaling. [1] https://www.postgresql.org/message-id/CAA4eK1KA4LwTYXZG%3Dk3J2bA%2BZOEYnz2gkyWBKgY%3D_q0xJRBMDw%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 3_head.out Description: Binary data 3_patch.out Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0
On Tue, Dec 20, 2016 at 8:04 PM, Robert Haas wrote: > On Sat, Dec 17, 2016 at 5:46 AM, Amit Kapila wrote: >> Yeah, but we are planning to add a generic flag in WaitEvent structure >> which can be used for similar purpose. However, as per your >> suggestion, I have changed it only for Win32 port. Also, I kept the >> change in ModifyWaitEvent API as that seems to be okay considering >> 'reset' is a generic flag in WaitEvent structure. > > Well, we don't really need the change in ModifyWaitEvent if we have > the change in WaitEventSetWaitBlock, right? > Yes. > I'd be inclined to ditch > the former and keep the latter. > Okay, I think you want to keep the change just for Win32 which seems fair as we haven't noticed such problem on any other platform. > Also, this doesn't look right: > > + for (cur_event = set->events; > +cur_event < (set->events + set->nevents) > +&& returned_events < nevents; > +cur_event++) > + { > + if (cur_event->reset) > + { > + WaitEventAdjustWin32(set, cur_event); > + cur_event->reset = false; > + } > + } > > There's no need to include returned_events < nevents in the loop > condition here because returned_events isn't changing. > Fixed. > I think I'd also guard the reset flag with #ifdef WIN32. If it's not > properly supported elsewhere it's just a foot-gun, and there seems to > be no reason to write the code to properly support it elsewhere, > whatever that would mean. > Okay. Ashutosh Sharma has helped to test that pldebugger issue is fixed with attached version. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com reset_wait_events_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bigint vs txid user confusion
On 17 December 2016 at 00:13, Robert Haas wrote: > On Thu, Dec 15, 2016 at 3:02 AM, Craig Ringer wrote: >> I really wish we could just change the pg_stat_activity and >> pg_stat_replication xid fields to be epoch qualified in a 64-bit wide >> 'fullxid' type, or similar. > > I think that approach is worth considering. I'm worried about how many monitoring tools it'd break. Arguably most or all will already be broken and not know it, though. Those that aren't will be using age(xid), which we can support just fine for the new type too, so they won't notice. Ideally I'd just like to widen 'xid' to 64-bits. This would upset clients that use binary mode and "know" it's a 32-bit type on the wire, though, so I'm not sure it's a good idea even though it'd be nicer in pretty much every other way. So the idea then is to add a new bigxid type, a 64-bit wide epoch-extended xid, and replace _all_ appearances of 'xid' with it in catalogs, views, etc, such that 'xid' is entirely deprecated. Rather than convert 64-bit extended XIDs to 32-bit for internal comparisons/functions/operators we'll just epoch-extend our 32-bit xids, getting rid of the need to handle any sort of "too old" concept in most cases. The return type of txid_current() should probably change to the new bitxid type. This'll upset apps that expect to do maths on it since the new bigxid type won't have many operators, but since most (all?) of that maths will be wrong I don't think that's a bad thing. Banging in a ::bigint will quickfix so adaptation is easy, and it'll highlight incorrect uses (many) of the call. The type is on-wire compatible with the current bigint return type until you hit epoch 2^31 in which case you have bigger things to worry about, like pending epoch wraparound. HOWEVER, if we're going to really remove 'xid' from user view, it should vanish from pg_database and pg_class too. That's a LOT more intrusive, and widens pg_class by 4 bytes per row for minimal gain. No entry there can ever have an epoch older than the current epoch - 1, and only then the part that's after the wraparound threshold. pg_class is a raw relation so we can't transform what the user sees via a view or function. Tools look at pg_class.relfrozenxid and pg_databse.datfrozenxid more than probably anything else, so making changes that ignores them is pretty pointless. Everything else looks easy and minimally intrusive, but I'm not sure there's a sensible answer to pg_class.relfrozenxid. test=> select table_name, column_name from information_schema.columns where data_type = 'xid'; table_name | column_name --+--- pg_class | relfrozenxid pg_class | relminmxid pg_database | datfrozenxid pg_database | datminmxid pg_locks | transactionid pg_prepared_xacts| transaction pg_stat_activity | backend_xid pg_stat_activity | backend_xmin pg_stat_replication | backend_xmin pg_replication_slots | xmin pg_replication_slots | catalog_xmin (11 rows) test=> SELECT proname FROM pg_proc WHERE prorettype = 'xid'::regtype OR 'xid'::regtype = ANY (proargtypes); proname -- xidin xidout xideq age mxid_age xideqint4 pg_get_multixact_members pg_xact_commit_timestamp xidrecv xidsend (10 rows) test=> SELECT oprname FROM pg_operator WHERE 'xid'::regtype IN (oprleft, oprright, oprresult); oprname - = = (2 rows) -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Faster methods for getting SPI results
I've been looking at the performance of SPI calls within plpython. There's a roughly 1.5x difference from equivalent python code just in pulling data out of the SPI tuplestore. Some of that is due to an inefficiency in how plpython is creating result dictionaries, but fixing that is ultimately a dead-end: if you're dealing with a lot of results in python, you want a tuple of arrays, not an array of tuples. While we could just brute-force a tuple of arrays by plowing through the SPI tuplestore (this is what pl/r does), there's still a lot of extra work involved in doing that. AFAICT there's at least 2 copies that happen between the executor producing a tuple and it making it into the tuplestore, plus the tuplestore is going to consume a potentially very large amount of memory for a very short period of time, before all the data gets duplicated (again) into python objects. It would be a lot more efficient if we could just grab datums from the executor and make a single copy into plpython (or R), letting the PL deal with all the memory management overhead. I briefly looked at using SPI cursors to do just that, but that looks even worse: every fetch is executed in a subtransaction, and every fetch creates an entire tuplestore even if it's just going to return a single value. (But hey, we never claimed cursors were fast...) Is there any way to avoid all of this? I'm guessing one issue might be that we don't want to call an external interpreter while potentially holding page pins, but even then couldn't we just copy a single tuple at a time and save a huge amount of palloc overhead? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background contrib module proposal
On Tue, Dec 20, 2016 at 7:32 PM, Simon Riggs wrote: > On 9 December 2016 at 13:00, Robert Haas wrote: >> That might be good, because then we wouldn't have to maintain two >> copies of the code. > > So why are there two things at all? Why is this being worked on as > well as Peter's patch? What will that give us? A feature that can be accessed without writing C code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Dec 20, 2016 at 8:14 PM, Peter Geoghegan wrote: > Without meaning to sound glib, unification is the process by which > parallel CREATE INDEX has the leader read temp files from workers > sufficient to complete its final on-the-fly merge. That's not glib, but you can't in the end define BufFile unification in terms of what parallel CREATE INDEX needs. Whatever changes we make to lower-level abstractions in the service of some higher-level goal need to be explainable on their own terms. >>It's fine to make up new words -- indeed, in some sense that is the essence >> of writing any complex problem -- but you have to define them. > > I invite you to help me define this new word. If at some point I'm able to understand what it means, I'll try to do that. I think you're loosely using "unification" to mean combining stuff from different backends in some way that depends on the particular context, so that "BufFile unification" can be different from "LogicalTape unification". But that's just punting the question of what each of those things actually are. > Maybe it's inferior to that, but I think what Heikki proposes is more > or less complementary to what I've proposed, and has nothing to do > with resource management and plenty to do with making the logtape.c > interface look nice, AFAICT. It's also about refactoring/simplifying > logtape.c itself, while we're at it. I believe that Heikki has yet to > comment either way on my approach to resource management, one aspect > of the patch that I was particularly keen on your looking into. My reading of Heikki's point was that there's not much point in touching the BufFile level of things if we can do all of the necessary stuff at the LogicalTape level, and I agree with him about that. If a shared BufFile had a shared read-write pointer, that would be a good justification for having it. But it seems like unification at the BufFile level is just concatenation, and that can be done just as well at the LogicalTape level, so why tinker with BufFile? As I've said, I think there's some low-level hacking needed here to make sure files get removed at the correct time in all cases, but apart from that I see no good reason to push the concatenation operation all the way down into BufFile. > The theory of operation here is that workers own their own BufFiles, > and are responsible for deleting them when they die. The assumption, > rightly or wrongly, is that it's sufficient that workers flush > everything out (write out temp files), and yield control to the > leader, which will open their temp files for the duration of the > leader final on-the-fly merge. The resource manager in the leader > knows it isn't supposed to ever delete worker-owned files (just > close() the FDs), and the leader errors if it cannot find temp files > that match what it expects. If there is a an error in the leader, it > shuts down workers, and they clean up, more than likely. If there is > an error in the worker, or if the files cannot be deleted (e.g., if > there is a classic hard crash scenario), we should also be okay, > because nobody will trip up on some old temp file from some worker, > since fd.c has some gumption about what workers need to do (and what > the leader needs to avoid) in the event of a hard crash. I don't see a > risk of file descriptor leaks, which may or may not have been part of > your concern (please clarify). I don't think there's any new issue with file descriptor leaks here, but I think there is a risk of calling unlink() too early or too late with your design. My proposal was an effort to nail that down real tight. >> If the worker is always completely finished with the tape before the >> leader touches it, couldn't the leader's LogicalTapeSet just "adopt" >> the tape and overwrite it like any other? > > I'll remind you that parallel CREATE INDEX doesn't actually ever need > to be randomAccess, and so we are not actually going to ever need to > do this as things stand. I wrote the code that way in order to not > break the existing interface, which seemed like a blocker to posting > the patch. I am open to the idea of such an "adoption" occurring, even > though it actually wouldn't help any case that exists in the patch as > proposed. I didn't go that far in part because it seemed premature, > given that nobody had looked at my work to date at the time, and given > the fact that there'd be no initial user-visible benefit, and given > how the exact meaning of "unification" was (and is) somewhat in flux. > > I see no good reason to not do that, although that might change if I > actually seriously undertook to teach the leader about this kind of > "adoption". I suspect that the interface specification would make for > confusing reading, which isn't terribly appealing, but I'm sure I > could manage to make it work given time. I think the interface is pretty clear: the worker's logical tapes get incorporated into the leader's LogicalTapeSet as if they'd bee
Re: [HACKERS] BUG: pg_stat_statements query normalization issues with combined queries
On Tue, Dec 20, 2016 at 6:18 AM, Fabien COELHO wrote: > Would this approach be acceptable, or is modifying Nodes a no-go area? > > If it is acceptable, I can probably put together a patch and submit it. > > If not, I suggest to update the documentation to tell that > pg_stat_statements does not work properly with combined queries. I think you've found a bug, but I'm a little doubtful about your proposed fix. However, I haven't studied the code, so I don't know what other approach might be better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding on standby
On 20 December 2016 at 15:03, Petr Jelinek wrote: > But in 0003 I don't understand following code: >> + if (endpos != InvalidXLogRecPtr && !do_start_slot) >> + { >> + fprintf(stderr, >> + _("%s: cannot use --create-slot or --drop-slot >> together with --endpos\n"), >> + progname); >> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), >> + progname); >> + exit(1); >> + } > > Why is --create-slot and --endpos not allowed together? Actually, the test is fine, the error is just misleading due to my misunderstanding. The fix is simply to change the error message to _("%s: --endpos may only be specified with --start\n"), so I won't post a separate followup patch. Okano Naoki tried to bring this to my attention earlier, but I didn't understand as I hadn't yet realised you could use --create-slot --start, they weren't mutually exclusive. (We test to ensure --start --drop-slot isn't specified earlier so no test for do_drop_slot is required). -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Why does plpython delay composite type resolution?
Why do functions that accept composite types delay type resolution until execution? I have a naive patch that speeds up plpy.execute() by 8% by caching interred python strings for the dictionary key names (which are repeated over and over). The next step is to just pre-allocate those strings as appropriate for the calling context, but it's not clear how to handle that for input arguments. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding on standby
On 20 December 2016 at 15:03, Petr Jelinek wrote: >> The biggest change in this patch, and the main intrusive part, is that >> procArray->replication_slot_catalog_xmin is no longer directly used by >> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is >> added, with a corresponding CheckPoint field. >> [snip] > > If this mechanism would not be needed most of the time, wouldn't it be > better to not have it and just have a way to ask physical slot about > what's the current reserved catalog_xmin (in most cases the standby > should actually know what it is since it's sending the hs_feedback, but > first slot created on such standby may need to check). Yes, and that was actually my originally preferred approach, though the one above does offer the advantage that if something goes wrong we can detect it and fail gracefully. Possibly not worth the complexity though. Your approach requires us to make very sure that hot_standby_feedback does not get turned off by user or become ineffective once we're replicating, though, since we won't have any way to detect when needed tuples are removed. We'd probably just bail out with relcache/syscache lookup errors, but I can't guarantee we wouldn't crash if we tried logical decoding on WAL where needed catalogs have been removed. I initially ran into trouble doing that, but now think I have a way forward. > WRT preventing > hs_feedback going off, we can refuse to start with hs_feedback off when > there are logical slots detected. Yes. There are some ordering issues there though. We load slots quite late in startup and they don't get tracked in checkpoints. So we might launch the walreceiver before we load slots and notice their needed xmin/catalog_xmin. So we need to prevent sending of hot_standby_feedback until slots are loaded, or load slots earlier in startup. The former sounds less intrusive and safer - probably just add an "initialized" flag to ReplicationSlotCtlData and suppress hs_feedback until it becomes true. We'd also have to suppress the validation callback action on the hot_standby_feedback GUC until we know replication slot state is initialised, and perform the check during slot startup instead. The hot_standby_feedback GUC validation check might get called before shmem is even set up so we have to guard against attempts to access a shmem segment that may not event exist yet. The general idea is workable though. Refuse to start if logical slots exist and hot_standby_feedback is off or walreceiver isn't using a physical slot. Refuse to allow hot_standby_feedback to change > We can also refuse to connect to the > master without physical slot if there are logical slots detected. I > don't see problem with either of those. Agreed. We must also be able to reliably enforce that the walreceiver is using a replication slot to connect to the master and refuse to start if it is not. The user could change recovery.conf and restart the walreceiver while we're running, after we perform that check, so walreceiver must also refuse to start if logical replication slots exist but it has no primary slot name configured. > You may ask what if user drops the slot and recreates or somehow > otherwise messes up catalog_xmin on master, well, considering that under > my proposal we'd first (on connect) check the slot for catalog_xmin we'd > know about it so we'd either mark the local slots broken/drop them or > plainly refuse to connect to the master same way as if it didn't have > required WAL anymore (not sure which behavior is better). Note that user > could mess up catalog_xmin even in your design the same way, so it's not > really a regression. Agreed. Checking catalog_xmin of the slot when we connect is sufficient to guard against that, assuming we can trust that the catalog_xmin is actually in effect on the master. Consider cascading setups, where we set our catalog_xmin but it might not be "locked in" until the middle cascaded server relays it to the master. I have a proposed solution to that which I'll outline in a separate patch+post; it ties in to some work on addressing the race between hot standby feedback taking effect and queries starting on the hot standby. It boils down to "add a hot_standby_feedback reply protocol message". > Plus > it might even be okay to only allow creating logical slots on standbys > connected directly to master in v1. True. I didn't consider that. We haven't had much luck in the past with such limitations, but personally I'd consider it a perfectly reasonable one. > But in 0003 I don't understand following code: >> + if (endpos != InvalidXLogRecPtr && !do_start_slot) >> + { >> + fprintf(stderr, >> + _("%s: cannot use --create-slot or --drop-slot >> together with --endpos\n"), >> + progname); >> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), >> + progname); >> +
Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
On Wed, Dec 21, 2016 at 11:14 AM, Craig Ringer wrote: > I didn't see this in the CF app so I created it in > https://commitfest.postgresql.org/12/ as > https://commitfest.postgresql.org/12/924/ . I couldn't find your > PostgreSQL community username, so please log in and set yourself as > author. I have set patch state to "waiting on author" pending your > addressing these remarks. You need to import the user first, which is what I did and I have added the author name correctly. (Glad you jumped on this patch, I was just going to begin a lookup. So that's a nice timing.) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016/12/21 1:45, Alvaro Herrera wrote: > Robert Haas wrote: >> On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera >> wrote: >>> Even if we decide to keep the message, I think it's not very good >>> wording anyhow; as a translator I disliked it on sight. Instead of >>> "skipping scan to validate" I would use "skipping validation scan", >>> except that it's not clear what it is we're validating. Mentioning >>> partition constraint in errcontext() doesn't like a great solution, but >>> I can't think of anything better. >> >> Maybe something like: partition constraint for table \"%s\" is implied >> by existing constraints > > Actually, shouldn't we emit a message if we *don't* skip the check? Scanning (aka, not skipping) to validate the partition constraint is the default behavior, so a user would be expecting it anyway, IOW, need not be informed of it. But when ATExecAttachPartition's efforts to avoid the scan by comparing the partition constraint against existing constraints (which the user most probably deliberately added just for this) succeed, that seems like a better piece of information to provide the user with, IMHO. But then again, having a message printed before a potentially long validation scan seems like something a user would like to see, to know what it is that is going to take so long. Hmm. Anyway, what would the opposite of Robert's suggested message look like: "scanning table \"%s\" to validate partition constraint"? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
On 21 December 2016 at 00:52, Ants Aasma wrote: > The simple fix seems to be to always send out at least one feedback > message on each connect regardless of hot_standby_feedback setting. I agree. > Patch attached. Looks like this goes back to version 9.4. It could > conceivably cause issues for replication middleware that does not know > how to handle hot standby feedback messages. Not sure if any exist and > if that is a concern. If it exists it's already broken. Not our problem IMO. > A shell script to reproduce the problem is also attached, adjust the > PGPATH variable to your postgres install and run in an empty > directory. I wrote some TAP tests for 9.6 (they're on the logical decoding on standby thread) that can be extended to check this. Once they're committed we can add a test for this change. Re the patch, I don't like - static bool master_has_standby_xmin = false; + static bool master_has_standby_xmin = true; without any comment. It's addressed in the comment changes on the header func, but the link isn't obvious. Maybe a oneliner to say "ensure we always send at least one feedback message" ? I think this part of the patch is correct and useful. I don't see the point of +* +* If Hot Standby is not yet active we reset the xmin value. Check this +* after the interval has expired to reduce number of calls. */ - if (hot_standby_feedback) + if (hot_standby_feedback && HotStandbyActive()) though. Forcing feedback to send InvalidTransactionId until hot standby feedback is actively running seems counterproductive; we want to lock in feedback as soon as possible, not wait until we're accepting transactions. Simon and I are in fact working on changes to do the opposite of what you've got here and ensure that we don't allow hot standby connections until we know hot_standby_feedback is in effect and a usable xmin is locked in. That way the master won't remove tuples we need as soon as we start an xact and cause a conflict with recovery. If there are no running xacts, GetOldestXmin() will return the slot xmin if any. We should definitely not be clearing that just because we're not accepting hot standby connections yet; we want to make very sure it remains in effect unless the user explicitly de-configures hot standby. (There's another pre-exisitng problem there, where we can start the walsender before slots are initialized, that I'll be addressing separately). If there's no slot xmin either, we'll send nextXid. That's a sensible starting point for hot standby feedback until we start some transactions. So -1 on this part of the patch, unless there's something I've misunderstood. I didn't see this in the CF app so I created it in https://commitfest.postgresql.org/12/ as https://commitfest.postgresql.org/12/924/ . I couldn't find your PostgreSQL community username, so please log in and set yourself as author. I have set patch state to "waiting on author" pending your addressing these remarks. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
On Tue, Dec 20, 2016 at 9:23 PM, Heikki Linnakangas wrote: > On 12/16/2016 03:31 AM, Michael Paquier wrote: > Actually, it does still perform that check. There's a new function, > plain_crypt_verify, that passwordcheck uses now. plain_crypt_verify() is > intended to work with any future hash formats we might introduce in the > future (including SCRAM), so that passwordcheck doesn't need to know about > all the hash formats. Bah. I have misread the first version of the patch, and it is indeed keeping the username checks. Now that things don't crash that behaves as expected: =# load 'passwordcheck'; LOAD =# alter role mpaquier password 'mpaquier'; ERROR: 22023: password must not contain user name LOCATION: check_password, passwordcheck.c:101 =# alter role mpaquier password 'md58349d3a1bc8f4f7399b1ff9dea493b15'; ERROR: 22023: password must not contain user name LOCATION: check_password, passwordcheck.c:82 With the patch: >> +case PASSWORD_TYPE_PLAINTEXT: >> +shadow_pass = &shadow_pass[strlen("plain:")]; >> +break; >> It would be a good idea to have a generic routine able to get the plain >> password value. In short I think that we should reduce the amount of >> locations where "plain:" prefix is hardcoded. > > There is such a function included in the patch, get_plain_password(char > *shadow_pass), actually. Contrib/passwordcheck uses it. I figured that in > crypt.c itself, it's OK to do the above directly, but get_plain_password() > is intended to be used elsewhere. The idea would be to have the function not return an allocated string, just a position to it. That would be useful in plain_crypt_verify() for example, for a total of 4 places, including get_plain_password() where the new string allocation is done. Well, it's not like this prefix "plain:" would change anyway in the future nor that it is going to spread much. > Thanks for having a look! Attached is a new version, with that bug fixed. I have been able more advanced testing without the crash and things seem to work properly. The attached set of tests is also able to pass for all the combinations of hba configurations and password formats. And looking at the code I don't have more comments. -- Michael 009_authentication.pl Description: Perl program -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
At Tue, 20 Dec 2016 23:47:22 +0900, Fujii Masao wrote in > On Tue, Dec 20, 2016 at 2:46 PM, Michael Paquier > wrote: > > On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada > > wrote: > >> Do we need to consider the sorting method and the selecting k-th > >> latest LSN method? > > > > Honestly, nah. Tests are showing that there are many more bottlenecks > > before that with just memory allocation and parsing. > > I think that it's worth prototyping alternative algorithm, and > measuring the performances of those alternative and current > algorithms. This measurement should check not only the bottleneck > but also how much each algorithm increases the time that backends > need to wait for before they receive ack from walsender. > > If it's reported that current algorithm is enough "effecient", > we can just leave the code as it is. OTOH, if not, let's adopt > the alternative one. I'm personally interested in the difference of them but it doesn't seem urgently required. If we have nothing particular to do with this, considering other ordering method would be valuable. By a not-well-grounded thought though, maintaining top-kth list by insertion sort would be promising rather than running top-kth sorting on the whole list. Sorting on all walsenders is needed for the first time and some other situation though. By the way, do we continue dispu^h^hcussing on the format of s_s_names and/or a successor right now? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protect syscache from bloating with negative cache entries
Thank you for the discussion. At Tue, 20 Dec 2016 15:10:21 -0500, Tom Lane wrote in <23492.1482264...@sss.pgh.pa.us> > The bigger picture here though is that we used to have limits on syscache > size, and we got rid of them (commit 8b9bc234a, see also > https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us) > not only because of the problem you mentioned about performance falling > off a cliff once the working-set size exceeded the arbitrary limit, but > also because enforcing the limit added significant overhead --- and did so > whether or not you got any benefit from it, ie even if the limit is never > reached. Maybe the present patch avoids imposing a pile of overhead in > situations where no pruning is needed, but it doesn't really look very > promising from that angle in a quick once-over. Indeed. As mentioned in the mail at the beginning of this thread, it hits the whole-cache scanning if at least one negative cache exists even it is not in a relation with the target relid, and it can be significantly long on a fat cache. Lists of negative entries like CatCacheList would help but needs additional memeory. > BTW, I don't see the point of the second patch at all? Surely, if > an object is deleted or updated, we already have code that flushes > related catcache entries. Otherwise the caches would deliver wrong > data. Maybe you take the patch wrongly. Negative entires won't be flushed by any means. Deletion of a namespace causes cascaded object deletion according to dependency then finaly goes to non-neative cache invalidation. But a removal of *negative entries* in RELNAMENSP won't happen. The test script for the case (gen2.pl) does the following thing, CREATE SCHEMA foo; SELECT * FROM foo.invalid; DROP SCHEMA foo; Removing the schema foo leaves a negative cache entry for 'foo.invalid' in RELNAMENSP. However, I'm not sure the above situation happens so frequent that it is worthwhile to amend. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas wrote: >> What I have in mind is something like the attached patch. It refactors >> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape >> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet >> doesn't have the concept of a tape number anymore, it can contain any number >> of tapes, and you can create more on the fly. With that, it'd be fairly easy >> to make tuplesort.c merge LogicalTapes that came from different tape sets, >> backed by different BufFiles. I think that'd avoid much of the unification >> code. > > I just looked at the buffile.c/buffile.h changes in the latest version > of the patch and I agree with this criticism, though maybe not with > the proposed solution. I actually don't understand what "unification" > is supposed to mean. The patch really doesn't explain that anywhere > that I can see. It says stuff like: > > + * Parallel operations can use an interface to unify multiple worker-owned > + * BufFiles and a leader-owned BufFile within a leader process. This relies > + * on various fd.c conventions about the naming of temporary files. Without meaning to sound glib, unification is the process by which parallel CREATE INDEX has the leader read temp files from workers sufficient to complete its final on-the-fly merge. So, it's a terminology that's bit like "speculative insertion" was up until UPSERT was committed: a concept that is somewhat in flux, and describes a new low-level mechanism built to support a higher level operation, which must accord with a higher level set of requirements (so, for speculative insertion, that would be avoiding "unprincipled deadlocks" and so on). That being the case, maybe "unification" isn't useful as an precise piece of terminology at this point, but that will change. While I'm fairly confident that I basically have the right idea with this patch, I think that you are better at judging the ins and outs of resource management than I am, not least because of the experience of working on parallel query itself. Also, I'm signed up to review parallel hash join in large part because I think there might be some convergence concerning the sharing of BufFiles among parallel workers. I don't think I'm qualified to judge what a general abstraction like this should look like, but I'm trying to get there. > That comment tells you that unification is a thing you can do -- via > an unspecified interface for unspecified reasons using unspecified > conventions -- but it doesn't tell you what the semantics of it are > supposed to be. For example, if we "unify" several BufFiles, do they > then have a shared seek pointer? No. > Do the existing contents effectively > get concatenated in an unpredictable order, or are they all expected > to be empty at the time unification happens? Or something else? The order is the same order as ordinal identifiers are assigned to workers within tuplesort.c, which is undefined, with the notable exception of the leader's own identifier (-1) and area of the unified BufFile space (this is only relevant in randomAccess cases, where leader may write stuff out to its own reserved part of the BufFile space). It only matters that the bit of metadata in shared memory is in that same order, which it clearly will be. So, it's unpredictable, but in the same way that ordinal identifiers are assigned in a not-well-defined order; it doesn't or at least shouldn't matter. We can imagine a case where it does matter, and we probably should, but that case isn't parallel CREATE INDEX. >It's fine to make up new words -- indeed, in some sense that is the essence > of writing any complex problem -- but you have to define them. I invite you to help me define this new word. > It's hard to understand how something like this doesn't leak > resources. Maybe that's been thought about here, but it isn't very > clear to me how it's supposed to work. I agree that it would be useful to centrally document what all this unification stuff is about. Suggestions on where that should live are welcome. > In Heikki's proposal, if > process A is trying to read a file owned by process B, and process B > dies and removes the file before process A gets around to reading it, > we have got trouble, especially on Windows which apparently has low > tolerance for such things. Peter's proposal avoids that - I *think* - > by making the leader responsible for all resource cleanup, but that's > inferior to the design we've used for other sorts of shared resource > cleanup (DSM, DSA, shm_mq, lock groups) where the last process to > detach always takes responsibility. Maybe it's inferior to that, but I think what Heikki proposes is more or less complementary to what I've proposed, and has nothing to do with resource management and plenty to do with making the logtape.c interface look nice, AFAICT. It's also about refactoring/simplifying logtape.c itself, while we're at it. I believe that Heik
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
David, * David Fetter (da...@fetter.org) wrote: > On Tue, Dec 20, 2016 at 06:14:40PM -0500, Stephen Frost wrote: > > * David Fetter (da...@fetter.org) wrote: > > > On Tue, Dec 20, 2016 at 08:34:19AM -0500, Stephen Frost wrote: > > > > * Heikki Linnakangas (hlinn...@iki.fi) wrote: > > > > > Even if you have a separate "verifier type" column, it's not fully > > > > > normalized, because there's still a dependency between the > > > > > verifier and verifier type columns. You will always need to look > > > > > at the verifier type to make sense of the verifier itself. > > > > > > > > That's true- but you don't need to look at the verifier, or even > > > > have *access* to the verifier, to look at the verifier type. > > > > > > Would a view that shows only what's to the left of the first semicolon > > > suit this purpose? > > > > Obviously a (security barrier...) view or a (security definer) function > > could be used, but I don't believe either is actually a good idea. > > Would you be so kind as to help me understand what's wrong with that idea? For starters, it doubles-down on the assumption that we'll always be happy with that particular separator and implies to anyone watching that they'll be able to trust it. Further, it's additional complication which, at least to my eyes, is entirely in the wrong direction. We could push everything in pg_authid into a single colon-separated text field and call it simpler because we don't have to deal with those silly column things, and we'd have something a lot closer to a unix passwd file too!, but it wouldn't make it a terribly smart thing to do. We aren't a bunch of individual C programs having to parse out things out of flat text files, after all. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] pg_background contrib module proposal
On 9 December 2016 at 13:00, Robert Haas wrote: > That might be good, because then we wouldn't have to maintain two > copies of the code. So why are there two things at all? Why is this being worked on as well as Peter's patch? What will that give us? -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background contrib module proposal
On Mon, Dec 19, 2016 at 09:30:32PM +0500, Andrew Borodin wrote: > 2016-12-19 4:21 GMT+05:00 David Fetter : > > Couldn't it sleep in increments smaller than a second? Like maybe > > milliseconds? Also, it's probably cleaner (or at least more > > comprehensible) to write something using format() and dollar quoting > > than the line with the double 's. > > Right. Here's version which waits for half a second. I do not see how > to path doubles via dollar sign, pg_background_launch() gets a string > as a parameter, it's not EXCEUTE USING. I see. I find the following a little easier to follow, and the sleeps still work even when very short. Best, David. CREATE TABLE input AS SELECT x, row_number() OVER (ORDER BY x) n FROM generate_series(0,.05,0.01) x ORDER BY x DESC; CREATE TABLE output(place int,value float); CREATE TABLE handles AS SELECT pg_background_launch( format($$ SELECT pg_sleep(%s); INSERT INTO output VALUES (%s, %s) $$, x, n, x ) ) h FROM input; --wait until everyone finishes SELECT * FROM handles JOIN LATERAL pg_background_result(handles.h) AS (x TEXT) ON (true); --output results SELECT * FROM output ORDER BY place; -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
On Tue, Dec 20, 2016 at 06:14:40PM -0500, Stephen Frost wrote: > David, > > * David Fetter (da...@fetter.org) wrote: > > On Tue, Dec 20, 2016 at 08:34:19AM -0500, Stephen Frost wrote: > > > * Heikki Linnakangas (hlinn...@iki.fi) wrote: > > > > Even if you have a separate "verifier type" column, it's not fully > > > > normalized, because there's still a dependency between the > > > > verifier and verifier type columns. You will always need to look > > > > at the verifier type to make sense of the verifier itself. > > > > > > That's true- but you don't need to look at the verifier, or even > > > have *access* to the verifier, to look at the verifier type. > > > > Would a view that shows only what's to the left of the first semicolon > > suit this purpose? > > Obviously a (security barrier...) view or a (security definer) function > could be used, but I don't believe either is actually a good idea. Would you be so kind as to help me understand what's wrong with that idea? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protect syscache from bloating with negative cache entries
On Tue, Dec 20, 2016 at 3:10 PM, Tom Lane wrote: > Robert Haas writes: >> On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane wrote: >>> I don't understand why we'd make that a system-wide behavior at all, >>> rather than expecting each process to manage its own cache. > >> Individual backends don't have a really great way to do time-based >> stuff, do they? I mean, yes, there is enable_timeout() and friends, >> but I think that requires quite a bit of bookkeeping. > > If I thought that "every ten minutes" was an ideal way to manage this, > I might worry about that, but it doesn't really sound promising at all. > Every so many queries would likely work better, or better yet make it > self-adaptive depending on how much is in the local syscache. I don't think "every so many queries" is very promising at all. First, it has the same problem as a fixed cap on the number of entries: if you're doing a round-robin just slightly bigger than that value, performance will be poor. Second, what's really important here is to keep the percentage of wall-clock time spent populating the system caches small. If a backend is doing 4000 queries/second and each of those 4000 queries touches a different table, it really needs a cache of at least 4000 entries or it will thrash and slow way down. But if it's doing a query every 10 minutes and those queries round-robin between 4000 different tables, it doesn't really need a 4000-entry cache. If those queries are long-running, the time to repopulate the cache will only be a tiny fraction of runtime. If the queries are short-running, then the effect is, percentage-wise, just the same as for the high-volume system, but in practice it isn't likely to be felt as much. I mean, if we keep a bunch of old cache entries around on a mostly-idle backend, they are going to be pushed out of CPU caches and maybe even paged out. One can't expect a backend that is woken up after a long sleep to be quite as snappy as one that's continuously active. Which gets to my third point: anything that's based on number of queries won't do anything to help the case where backends sometimes go idle and sit there for long periods. Reducing resource utilization in that case would be beneficial. Ideally I'd like to get rid of not only the backend-local cache contents but the backend itself, but that's a much harder project. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
David, * David Fetter (da...@fetter.org) wrote: > On Tue, Dec 20, 2016 at 08:34:19AM -0500, Stephen Frost wrote: > > * Heikki Linnakangas (hlinn...@iki.fi) wrote: > > > Even if you have a separate "verifier type" column, it's not fully > > > normalized, because there's still a dependency between the > > > verifier and verifier type columns. You will always need to look > > > at the verifier type to make sense of the verifier itself. > > > > That's true- but you don't need to look at the verifier, or even > > have *access* to the verifier, to look at the verifier type. > > Would a view that shows only what's to the left of the first semicolon > suit this purpose? Obviously a (security barrier...) view or a (security definer) function could be used, but I don't believe either is actually a good idea. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] too low cost of Bitmap index scan
Pavel Stehule writes: > I am trying to fix slow query on PostgreSQL 9.5.4. > The data are almost in RAM If it's all in RAM, you'd likely be well-served to lower random_page_cost. It looks to me like the planner is estimating pretty accurately how many heap fetches will be eliminated by using the extra index; where it's off seems to be in the cost of those heap fetches relative to the index work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make pg_basebackup -x stream the default
On Tue, Dec 20, 2016 at 10:38 PM, Fujii Masao wrote: > On Mon, Dec 19, 2016 at 7:51 PM, Vladimir Rusinov wrote: >> The server must also be configured with max_wal_senders set high >> enough to leave at least one session available for the backup. > > I think that it's better to explain explicitly here that max_wal_senders > should be higher than one by default. Recovery tests are broken by this patch, the backup() method in PostgresNode.pm uses pg_basebackup -x: sub backup { my ($self, $backup_name) = @_; my $backup_path = $self->backup_dir . '/' . $backup_name; my $port= $self->port; my $name= $self->name; print "# Taking pg_basebackup $backup_name from node \"$name\"\n"; TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-p', $port, '-x', '--no-sync'); print "# Backup finished\n"; } -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas wrote: > I find this unification business really complicated. I think it'd be simpler > to keep the BufFiles and LogicalTapeSets separate, and instead teach > tuplesort.c how to merge tapes that live on different > LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single > LogicalTapeSet can contain tapes from different underlying BufFiles. > > What I have in mind is something like the attached patch. It refactors > LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape > as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet > doesn't have the concept of a tape number anymore, it can contain any number > of tapes, and you can create more on the fly. With that, it'd be fairly easy > to make tuplesort.c merge LogicalTapes that came from different tape sets, > backed by different BufFiles. I think that'd avoid much of the unification > code. I just looked at the buffile.c/buffile.h changes in the latest version of the patch and I agree with this criticism, though maybe not with the proposed solution. I actually don't understand what "unification" is supposed to mean. The patch really doesn't explain that anywhere that I can see. It says stuff like: + * Parallel operations can use an interface to unify multiple worker-owned + * BufFiles and a leader-owned BufFile within a leader process. This relies + * on various fd.c conventions about the naming of temporary files. That comment tells you that unification is a thing you can do -- via an unspecified interface for unspecified reasons using unspecified conventions -- but it doesn't tell you what the semantics of it are supposed to be. For example, if we "unify" several BufFiles, do they then have a shared seek pointer? Do the existing contents effectively get concatenated in an unpredictable order, or are they all expected to be empty at the time unification happens? Or something else? It's fine to make up new words -- indeed, in some sense that is the essence of writing any complex problem -- but you have to define them. As far as I can tell, the idea is that we're somehow magically concatenating the BufFiles into one big super-BufFile, but I'm fuzzy on exactly what's supposed to be going on there. It's hard to understand how something like this doesn't leak resources. Maybe that's been thought about here, but it isn't very clear to me how it's supposed to work. In Heikki's proposal, if process A is trying to read a file owned by process B, and process B dies and removes the file before process A gets around to reading it, we have got trouble, especially on Windows which apparently has low tolerance for such things. Peter's proposal avoids that - I *think* - by making the leader responsible for all resource cleanup, but that's inferior to the design we've used for other sorts of shared resource cleanup (DSM, DSA, shm_mq, lock groups) where the last process to detach always takes responsibility. That avoids assuming that we're always dealing with a leader-follower situation, it doesn't categorically require the leader to be the one who creates the shared resource, and it doesn't require the leader to be the last process to die. Imagine a data structure that is stored in dynamic shared memory and contains space for a filename, a reference count, and a mutex. Let's call this thing a SharedTemporaryFile or something like that. It offers these APIs: extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg); extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); After setting aside sizeof(SharedTemporaryFile) bytes in your shared DSM sgement, you call SharedTemporaryFileInitialize() to initialize them. Then, every process that cares about the file does SharedTemporaryFileAttach(), which bumps the reference count and sets an on_dsm_detach hook to decrement the reference count and unlink the file if the reference count thereby reaches 0. One of those processes does SharedTemporaryFileAssign(), which fills in the pathname and clears FD_TEMPORARY. Then, any process that has attached can call SharedTemporaryFileGetFile() to get a File which can then be accessed normally. So, the pattern for parallel sort would be: - Leader sets aside space and calls SharedTemporaryFileInitialize() and SharedTemporaryFileAttach(). - The cooperating worker calls SharedTemporaryFileAttach() and then SharedTemporaryFileAssign(). - The leader then calls SharedTemporaryFileGetFile(). Since the leader can attach to the file before the path name is filled in, there's no window where the file is at risk of being leaked. Before SharedTemporaryFileAssign(), the worker is solely responsible for removing the file; after that call, whichever of the leader and the worker exits last will remove the file.
Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait
On 12/20/16 3:49 PM, Tom Lane wrote: > Peter Eisentraut writes: >> Maybe the fix is to make --wait the default? > I was wondering about that too ... does anyone remember the rationale > for the current behavior? Probably because that didn't work reliably before pg_ctl learned how to get the right port number and PQping() and such things. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
On Wed, Dec 21, 2016 at 1:08 AM, David Fetter wrote: > Would a view that shows only what's to the left of the first semicolon > suit this purpose? Of course it would, you would just need to make the routines now checking the shape of MD5 and SCRAM identifiers available at SQL level and feed the strings into them. Now I am not sure that it's worth having a new superuser view for that. pg_roles and pg_shadow hide the information about verifiers. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait
On Tue, Dec 20, 2016 at 1:49 PM, Tom Lane wrote: > Peter Eisentraut writes: > > Maybe the fix is to make --wait the default? > > I was wondering about that too ... does anyone remember the rationale > for the current behavior? But the message for the non-wait case seems > like it could stand to be improved independently of that. > Not totally independent. If the default is changed to --wait then the message can be written assuming the user understands what "--no-wait" does; but if the default is left "--no-wait" then cluing the user into the asynchronous behavior and telling them how to get the more expected synchronous behavior would be helpful. David J.
Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait
Tom Lane wrote: > Ryan Murphy writes: > > I'm concerned some new users may not understand this behavior of pg_ctl, so > > I wanted to suggest that we add some additional messaging after "server > > starting" - something like: > > > $ pg_ctl -D datadir -l logfile start > > server starting > > (to wait for confirmation that server actually started, try pg_ctl again > > with --wait) > > That seems annoyingly verbose and nanny-ish. Perhaps we could get the > point across like this: > > $ pg_ctl -D datadir -l logfile start > requested server to start +1, but also +1 to making --wait the default. Extra points if systemd start scripts are broken by the change ;-) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait
Peter Eisentraut writes: > Maybe the fix is to make --wait the default? I was wondering about that too ... does anyone remember the rationale for the current behavior? But the message for the non-wait case seems like it could stand to be improved independently of that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait
Ryan Murphy writes: > I'm concerned some new users may not understand this behavior of pg_ctl, so > I wanted to suggest that we add some additional messaging after "server > starting" - something like: > $ pg_ctl -D datadir -l logfile start > server starting > (to wait for confirmation that server actually started, try pg_ctl again > with --wait) That seems annoyingly verbose and nanny-ish. Perhaps we could get the point across like this: $ pg_ctl -D datadir -l logfile start requested server to start $ regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait
On Tue, Dec 20, 2016 at 03:43:11PM -0500, Peter Eisentraut wrote: > On 12/20/16 3:31 PM, Ryan Murphy wrote: > > I'm concerned some new users may not understand this behavior of pg_ctl, > > so I wanted to suggest that we add some additional messaging after > > "server starting" - something like: > > > > $ pg_ctl -D datadir -l logfile start > > server starting > > (to wait for confirmation that server actually started, try pg_ctl again > > with --wait) > > Maybe the fix is to make --wait the default? +1 It's not super useful to have the prompt return while the server is still starting up. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait
On 12/20/16 3:31 PM, Ryan Murphy wrote: > I'm concerned some new users may not understand this behavior of pg_ctl, > so I wanted to suggest that we add some additional messaging after > "server starting" - something like: > > $ pg_ctl -D datadir -l logfile start > server starting > (to wait for confirmation that server actually started, try pg_ctl again > with --wait) Maybe the fix is to make --wait the default? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Clarifying "server starting" messaging in pg_ctl start without --wait
Hi Postgres Devs, I had a suggestion regarding the output pg_ctl gives when you use it to start the postgres server. At first I was going to write a patch, but then I decided to just ask you guys first to see what you think. I had an issue earlier where I was trying to upgrade my postgres database to a new major version and incidentally a new pg_catalog version, and therefore the new code could no longer run the existing data directory without pg_upgrade or pg_dump (I ended up needing pg_dump). Initially I was very confused because I tried running "pg_ctl -D datadir -l logfile start" like normal, and it just said "server starting", yet the server was not starting. It took me a while to realize that I needed to use the "--wait" / "-w" option to actually wait and test whether the server was really starting, at which point it told me there was a problem and to check the log. I'm concerned some new users may not understand this behavior of pg_ctl, so I wanted to suggest that we add some additional messaging after "server starting" - something like: $ pg_ctl -D datadir -l logfile start server starting (to wait for confirmation that server actually started, try pg_ctl again with --wait) What do you guys think? Is it important to keep pg_ctl output more terse than this? I do think something like this could help new users avoid frustration. I'm happy to write a patch for this if it's helpful, though it's such a simple change that if one of the core devs wants this s/he can probably more easily just add it themselves. Cheers, Ryan
Re: [HACKERS] Rethinking our fulltext phrase-search implementation
I wrote: > I've been thinking about how to fix the problem Andreas Seltenreich > reported at > https://postgr.es/m/87eg1y2s3x@credativ.de Attached is a proposed patch that deals with the problems discussed here and in <26706.1482087...@sss.pgh.pa.us>. Is anyone interested in reviewing this, or should I just push it? BTW, I noticed that ts_headline() seems to not behave all that nicely for phrase searches, eg regression=# SELECT ts_headline('simple', '1 2 3 1 3'::text, '2 <-> 3', 'ShortWord=0'); ts_headline 1 2 3 1 3 (1 row) Highlighting the second "3", which is not a match, seems pretty dubious. Negative items are even worse, they don't change the results at all: regression=# SELECT ts_headline('simple', '1 2 3 1 3'::text, '!2 <-> 3', 'ShortWord=0'); ts_headline 1 2 3 1 3 (1 row) However, the code involved seems unrelated to the present patch, and it's also about as close to completely uncommented as I've seen anywhere in the PG code base. So I'm not excited about touching it. regards, tom lane diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 67d0c34..464ce83 100644 *** a/doc/src/sgml/datatype.sgml --- b/doc/src/sgml/datatype.sgml *** SELECT 'fat & rat & ! cat'::tsqu *** 3959,3973 tsquery 'fat' & 'rat' & !'cat' - - SELECT '(fat | rat) <-> cat'::tsquery; - tsquery - --- - 'fat' <-> 'cat' | 'rat' <-> 'cat' - - The last example demonstrates that tsquery sometimes - rearranges nested operators into a logically equivalent formulation. --- 3959,3965 diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index 2da7595..bc33a70 100644 *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** SELECT 'fat & cow'::tsquery @@ 'a fa *** 264,270 text, any more than a tsvector is. A tsquery contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. ! (For details see .) There are functions to_tsquery, plainto_tsquery, and phraseto_tsquery that are helpful in converting user-written text into a proper --- 264,270 text, any more than a tsvector is. A tsquery contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. ! (For syntax details see .) There are functions to_tsquery, plainto_tsquery, and phraseto_tsquery that are helpful in converting user-written text into a proper *** text @@ text *** 323,328 --- 323,330 at least one of its arguments must appear, while the ! (NOT) operator specifies that its argument must not appear in order to have a match. + For example, the query fat & ! rat matches documents that + contain fat but not rat. *** SELECT phraseto_tsquery('the cats ate th *** 377,382 --- 379,401 then &, then <->, and ! most tightly. + + + It's worth noticing that the AND/OR/NOT operators mean something subtly + different when they are within the arguments of a FOLLOWED BY operator + than when they are not, because then the position of the match is + significant. Normally, !x matches only documents that do not + contain x anywhere. But x <-> !y + matches x if it is not immediately followed by y; + an occurrence of y elsewhere in the document does not prevent + a match. Another example is that x & y normally only + requires that x and y both appear somewhere in the + document, but (x & y) <-> z requires x + and y to match at the same place, immediately before + a z. Thus this query behaves differently from x + <-> z & y <-> z, which would match a document + containing two separate sequences x z and y z. + diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c index efc111e..3e0a444 100644 *** a/src/backend/utils/adt/tsginidx.c --- b/src/backend/utils/adt/tsginidx.c *** checkcondition_gin(void *checkval, Query *** 212,218 * Evaluate tsquery boolean expression using ternary logic. */ static GinTernaryValue ! TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem) { GinTernaryValue val1, val2, --- 212,218 * Evaluate tsquery boolean expression using ternary logic. */ static GinTernaryValue ! TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase) { GinTernaryValue val1, val2, *** TS_execute_ternary(GinChkVal *gcv, Query *** 230,236 switch (curitem->qoperator.oper) { case OP_NOT: ! result
Re: [HACKERS] Protect syscache from bloating with negative cache entries
Robert Haas writes: > On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane wrote: >> I don't understand why we'd make that a system-wide behavior at all, >> rather than expecting each process to manage its own cache. > Individual backends don't have a really great way to do time-based > stuff, do they? I mean, yes, there is enable_timeout() and friends, > but I think that requires quite a bit of bookkeeping. If I thought that "every ten minutes" was an ideal way to manage this, I might worry about that, but it doesn't really sound promising at all. Every so many queries would likely work better, or better yet make it self-adaptive depending on how much is in the local syscache. The bigger picture here though is that we used to have limits on syscache size, and we got rid of them (commit 8b9bc234a, see also https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us) not only because of the problem you mentioned about performance falling off a cliff once the working-set size exceeded the arbitrary limit, but also because enforcing the limit added significant overhead --- and did so whether or not you got any benefit from it, ie even if the limit is never reached. Maybe the present patch avoids imposing a pile of overhead in situations where no pruning is needed, but it doesn't really look very promising from that angle in a quick once-over. BTW, I don't see the point of the second patch at all? Surely, if an object is deleted or updated, we already have code that flushes related catcache entries. Otherwise the caches would deliver wrong data. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cache Hash Index meta page.
On Tue, Dec 20, 2016 at 3:51 AM, Robert Haas wrote: > On Fri, Dec 16, 2016 at 5:16 AM, Mithun Cy > wrote: > >> Shouldn't _hash_doinsert() be using the cache, too >>> >> >> Yes, we have an opportunity there, added same in code. But one difference >> is at the end at-least once we need to read the meta page to split and/or >> write. Performance improvement might not be as much as read-only. >> > > Why would you need to read it at least once in one case but not the other? > For read-only: in _hash_first if target bucket is not plit after caching the meta page contents. we never need to read the metapage. But for insert: in _hash_doinsert at the end we modify the meta page to store the number of tuples. + * Write-lock the metapage so we can increment the tuple count. After + * incrementing it, check to see if it's time for a split. + */ + _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE); + + metap->hashm_ntuples += 1; Well, I guess that makes it more appealing to cache the whole page at least > in terms of raw number of bytes, but I suppose my real complaint here is > that there don't seem to be any clear rules for where, whether, and to what > extent we can rely on the cache to be valid. > --- Okay will revert back to cache the entire meta page. > Without that, I think this patch is creating an extremely serious > maintenance hazard for anyone who wants to try to modify this code in the > future. A future patch author needs to be able to tell what data they can > use and what data they can't use, and why. > -- I think if it is okay, I can document same for each member of HashMetaPageData whether to read from cached from meta page or directly from current meta page. Below briefly I have commented for each member. If you suggest I can go with that approach, I will produce a neat patch for same. *typedef struct HashMetaPageData* *{* *1. uint32 hashm_magic; /* magic no. for hash tables */* -- I think this should remain same. So can be read from catched meta page. *2. uint32 hashm_version; /* version ID */* -- This is one time initied, never changed afterwards. So can be read from catched metapage. *3. double hashm_ntuples; /* number of tuples stored in the table */* *-*- This changes on every insert. So should not be read from chached data. *4. uint16 hashm_ffactor; /* target fill factor (tuples/bucket) */* -- This is one time initied, never changed afterwards. So can be read from catched metapage. *5. uint16 hashm_bsize; /* index page size (bytes) */* -- This is one time initied, never changed afterwards. So can be read from catched metapage. *6. uint16 hashm_bmsize; /* bitmap array size (bytes) - must be a power of 2 */* -- This is one time initied, never changed afterwards. So can be read from catched metapage *7. uint16 hashm_bmshift; /* log2(bitmap array size in BITS) */* -- This is one time initied, never changed afterwards. So can be read from catched metapage *8. *If hashm_maxbucket, hashm_highmask and hashm_lowmask are all read and cached at same time when metapage was locked, then key to bucket number map function _hash_hashkey2bucket() should always produce same output. If bucket is split after caching above elements (which can be known because old bucket pages will never move once allocated and we mark bucket pages hasho_prevblkno with incremented hashm_highmask), we can invalidate them and re-read same from meta page. If your intention is not to save a metapage read while trying to map the key to buket page, then do not read them from cached meta page. * uint32 hashm_maxbucket; /* ID of maximum bucket in use */* * uint32 hashm_highmask; /* mask to modulo into entire table */* * uint32 hashm_lowmask; /* mask to modulo into lower half of table */* *9.* * uint32 hashm_ovflpoint;/* splitpoint from which ovflpgs being* * * allocated */* -- Since used for allocation of overflow pages, should get latest value directly from meta page. *10.* * uint32 hashm_firstfree; /* lowest-number free ovflpage (bit#) */* -- Should always be read from metapage directly. *11. * * uint32 hashm_nmaps; /* number of bitmap pages */* -- Should always be read from metapage directly. *12.* *RegProcedure hashm_procid; /* hash procedure id from pg_proc */* -- Never used till now, When we use, need to look at it about its policy. *13* *uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* spare pages before* * * each splitpoint */* Spares before given split_point never change and bucket pages never move. So when combined with cached hashm_maxbucket, hashm_highmask and hashm_lowmask, all read at same time under lock, BUCKET_TO_BLKNO should always produce same output, pointing to right physical block. Should only be used to save a meta page read when we want to map key to bucket block as said above. *14.* * BlockNumber hashm_mapp[HASH_MAX_BITMAPS]; /* blknos of ovfl bitmaps */* Always read from metapage durectly. *} HashMetaPageData;* > Any existing bugs should be t
Re: [HACKERS] Migration Docs WAS: Proposal for changes to recovery.conf API
On 12/19/2016 01:29 PM, Peter Eisentraut wrote: > On 12/16/16 8:52 PM, Robert Haas wrote: >> > If the explanation is just a few sentences long, I see no reason not >> > to include it in the release notes. > As far as I can tell from the latest posted patch, the upgrade > instructions are > > - To trigger recovery, create an empty file recovery.trigger instead of > recovery.conf. > > - All parameters formerly in recovery.conf are now regular > postgresql.conf parameters. For backward compatibility, recovery.conf > is read after postgresql.conf, but the parameters can now be put into > postgresql.conf if desired. Aren't we changing how some of the parameters work? > > Some of that might be subject to patch review, but it's probably not > going to be much longer than that. So I think that will fit well into > the usual release notes section. Changed the subject line of this thread because people are becoming confused about the new topic. I'm not talking about *just* the recovery.conf changes. We're making a lot of changes to 10 which will require user action, and there may be more before 10 is baked. For example, dealing with the version numbering change. I started a list of the things we're breaking for 10, but I don't have it with me at the moment. There's more than 3 things on it. And then there's docs for stuff which isn't *required* by upgrading, but would be a good idea. For example, we'll eventually want a doc on how to migrate old-style partitioned tables to new-style partitioned tables. In any case, Peter's response shows *exactly* why I don't want to put this documentation into the release notes: because people are going to complain it's too long and want to truncate it. Writing the docs will be hard enough; if I (or anyone else) has to argue about whether or not they're too long, I'm just going to drop the patch and walk away. -- -- Josh Berkus Red Hat OSAS (any opinions are my own) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.
Robert Haas wrote: > Implement table partitioning. I thought it was odd to use rd_rel->reloftype as a boolean in ATExecAttachPartition, but apparently we do it elsewhere too, so let's leave that complaint for another day. What I also found off in the same function is that we use SearchSysCacheCopyAttName() on each attribute and then don't free the result, and don't ever use the returned tuple either. A simple fix, I thought, just remove the "Copy" and add a ReleaseSysCache(). But then I noticed this whole thing is rather strange -- why not pass a boolean flag down to CreateInheritance() and from there to MergeAttributesIntoExisting() to implement the check? That seems less duplicative. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgstattuple documentation clarification
On 12/20/2016 10:01 AM, Tom Lane wrote: Andrew Dunstan writes: Recently a client was confused because there was a substantial difference between the reported table_len of a table and the sum of the corresponding tuple_len, dead_tuple_len and free_space. The docs are fairly silent on this point, and I agree that in the absence of explanation it is confusing, so I propose that we add a clarification note along the lines of: The table_len will always be greater than the sum of the tuple_len, dead_tuple_len and free_space. The difference is accounted for by page overhead and space that is not free but cannot be attributed to any particular tuple. Or perhaps we should be more explicit and refer to the item pointers on the page. I find "not free but cannot be attributed to any particular tuple" to be entirely useless weasel wording, not to mention wrong with respect to item pointers in particular. Well, the reason I put it like that was that in my experimentation, after I vacuumed the table after a large delete the item pointer table didn't seem to shrink (at least according to the pgstattuple output), so we had a page with 0 dead tuples but some non-live line pointer space. If that's not what's happening then something is going on that I don't understand. (Wouldn't be a first.) Perhaps we should start counting the item pointers in tuple_len. We'd still have to explain about page header overhead, but that would be a pretty small and fixed-size discrepancy. Sure, sounds like a good idea. Meanwhile it would be nice to explain to people exactly what we currently have. If you have a good formulation I'm all ears. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protect syscache from bloating with negative cache entries
On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane wrote: > Craig Ringer writes: >> On 20 December 2016 at 21:59, Robert Haas wrote: >>> We could implement this by having >>> some process, like the background writer, >>> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system >>> every 10 minutes or so. > >> ... on a rolling basis. > > I don't understand why we'd make that a system-wide behavior at all, > rather than expecting each process to manage its own cache. Individual backends don't have a really great way to do time-based stuff, do they? I mean, yes, there is enable_timeout() and friends, but I think that requires quite a bit of bookkeeping. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Mon, Dec 19, 2016 at 10:59 PM, Robert Haas wrote: > On Sun, Dec 18, 2016 at 10:00 PM, Amit Langote > wrote: >> Here are updated patches including the additional information. > > Thanks. Committed 0001. Will have to review the others when I'm less tired. 0002. Can you add a test case for the bug fixed by this patch? 0003. Loses equalTupleDescs() check and various cases where ExecOpenIndexes can be skipped. Isn't that bad? Also, "We locked all the partitions above including the leaf partitions" should say "Caller must have locked all the partitions including the leaf partitions". 0004. Unnecessary whitespace change in executor.h. Still don't understand why we need to hoist RelationGetPartitionQual() into the caller. 0005. Can you add a test case for the bug fixed by this patch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
I just had a client issue with table bloat that I traced back to a stale xmin value in a replication slot. xmin value from hot standby feedback is stored in replication slot and used for vacuum xmin calculation. If hot standby feedback is turned off while walreceiver is active then the xmin gets reset by HS feedback message containing InvalidTransactionId. However, if feedback gets turned off while standby is shut down this message never gets sent and a stale value gets left behind in the replication slot holding back vacuum. The simple fix seems to be to always send out at least one feedback message on each connect regardless of hot_standby_feedback setting. Patch attached. Looks like this goes back to version 9.4. It could conceivably cause issues for replication middleware that does not know how to handle hot standby feedback messages. Not sure if any exist and if that is a concern. A shell script to reproduce the problem is also attached, adjust the PGPATH variable to your postgres install and run in an empty directory. Regards, Ants Aasma diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index cc3cf7d..31333ec 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -1157,7 +1157,9 @@ XLogWalRcvSendReply(bool force, bool requestReply) * in case they don't have a watch. * * If the user disables feedback, send one final message to tell sender - * to forget about the xmin on this standby. + * to forget about the xmin on this standby. We also send this message + * on first connect because a previous connection might have set xmin + * on a replication slot. */ static void XLogWalRcvSendHSFeedback(bool immed) @@ -1167,7 +1169,7 @@ XLogWalRcvSendHSFeedback(bool immed) uint32 nextEpoch; TransactionId xmin; static TimestampTz sendTime = 0; - static bool master_has_standby_xmin = false; + static bool master_has_standby_xmin = true; /* * If the user doesn't want status to be reported to the master, be sure @@ -1192,20 +1194,13 @@ XLogWalRcvSendHSFeedback(bool immed) } /* - * If Hot Standby is not yet active there is nothing to send. Check this - * after the interval has expired to reduce number of calls. - */ - if (!HotStandbyActive()) - { - Assert(!master_has_standby_xmin); - return; - } - - /* * Make the expensive call to get the oldest xmin once we are certain * everything else has been checked. + * + * If Hot Standby is not yet active we reset the xmin value. Check this + * after the interval has expired to reduce number of calls. */ - if (hot_standby_feedback) + if (hot_standby_feedback && HotStandbyActive()) xmin = GetOldestXmin(NULL, false); else xmin = InvalidTransactionId; slot-xmin-not-reset-reproduce.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Robert Haas wrote: > On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera > wrote: > > Even if we decide to keep the message, I think it's not very good > > wording anyhow; as a translator I disliked it on sight. Instead of > > "skipping scan to validate" I would use "skipping validation scan", > > except that it's not clear what it is we're validating. Mentioning > > partition constraint in errcontext() doesn't like a great solution, but > > I can't think of anything better. > > Maybe something like: partition constraint for table \"%s\" is implied > by existing constraints Actually, shouldn't we emit a message if we *don't* skip the check? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background contrib module proposal
On Tue, Dec 20, 2016 at 11:11:36AM +0530, amul sul wrote: > On Tue, Dec 20, 2016 at 12:21 AM, David Fetter wrote: > > On Thu, Nov 24, 2016 at 09:16:53AM +0530, amul sul wrote: > >> Hi All, > >> > >> I would like to take over pg_background patch and repost for > >> discussion and review. > > > > This looks great. > > > > Sadly, it now breaks initdb when applied atop > > dd728826c538f000220af98de025c00114ad0631 with: > > > > performing post-bootstrap initialization ... TRAP: > > FailedAssertion("!(const Node*)(rinfo))->type) == T_RestrictInfo))", > > File: "costsize.c", Line: 660) > > sh: line 1: 2840 Aborted (core dumped) > > "/home/shackle/pggit/postgresql/tmp_install/home/shackle/10/bin/postgres" > > --single -F -O -j -c search_path=pg_catalog -c exit_on_error=true template1 > > > /dev/null > > > > I've complied binary with --enable-cassert flag and pg_background > patch to the top of described commit as well as on latest HEAD, but I > am not able to reproduce this crash, see this: > > [VM postgresql]$ /home/amul/Public/pg_inst/pg-master/bin/postgres > --single -F -O -j -c search_path=pg_catalog -c exit_on_error=true > template1 I haven't managed to reproduce it either. Sorry about the noise. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
On Tue, Dec 20, 2016 at 08:34:19AM -0500, Stephen Frost wrote: > Heikki, > > * Heikki Linnakangas (hlinn...@iki.fi) wrote: > > Even if you have a separate "verifier type" column, it's not fully > > normalized, because there's still a dependency between the > > verifier and verifier type columns. You will always need to look > > at the verifier type to make sense of the verifier itself. > > That's true- but you don't need to look at the verifier, or even > have *access* to the verifier, to look at the verifier type. Would a view that shows only what's to the left of the first semicolon suit this purpose? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Tue, Dec 20, 2016 at 10:27 AM, Alvaro Herrera wrote: > Even if we decide to keep the message, I think it's not very good > wording anyhow; as a translator I disliked it on sight. Instead of > "skipping scan to validate" I would use "skipping validation scan", > except that it's not clear what it is we're validating. Mentioning > partition constraint in errcontext() doesn't like a great solution, but > I can't think of anything better. Maybe something like: partition constraint for table \"%s\" is implied by existing constraints > (We have the table_rewrite event trigger, for a very similar use case.) Hmm, maybe we should see if we can safely support an event trigger here, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi Jesper, > I was planning to submit a new version next week for CF/January, so I'll > review your changes with the previous feedback in mind. > > Thanks for working on this ! As i was not seeing any updates from you for last 1 month, I thought of working on it. I have created a commit-fest entry for it and have added your name as main Author. I am sorry, I think i should have taken your opinion before starting to work on it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Robert Haas wrote: > On Tue, Dec 20, 2016 at 4:51 AM, Alvaro Herrera > wrote: > > Amit Langote wrote: > > > >> diff --git a/src/backend/commands/tablecmds.c > >> b/src/backend/commands/tablecmds.c > >> index 1c219b03dd..6a179596ce 100644 > >> --- a/src/backend/commands/tablecmds.c > >> +++ b/src/backend/commands/tablecmds.c > >> @@ -13297,8 +13297,10 @@ ATExecAttachPartition(List **wqueue, Relation > >> rel, PartitionCmd *cmd) > >> } > >> } > >> > >> + /* It's safe to skip the validation scan after all */ > >> if (skip_validate) > >> - elog(NOTICE, "skipping scan to validate partition > >> constraint"); > >> + ereport(INFO, > >> + (errmsg("skipping scan to validate partition > >> constraint"))); > > > > Why not just remove the message altogether? > > That's certainly an option. It might be noise in some situations. On > the other hand, it affects whether attaching the partition is O(1) or > O(n), so somebody might well want to know. Or maybe they might be > more likely to want a message in the reverse situation, telling them > that the partition constraint DOES need to be validated. I'm not sure > what the best user interface is here; thoughts welcome. Even if we decide to keep the message, I think it's not very good wording anyhow; as a translator I disliked it on sight. Instead of "skipping scan to validate" I would use "skipping validation scan", except that it's not clear what it is we're validating. Mentioning partition constraint in errcontext() doesn't like a great solution, but I can't think of anything better. (We have the table_rewrite event trigger, for a very similar use case.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
On 20 December 2016 at 15:03, Fujii Masao wrote: > API for crash recovery will never be changed. That is, crash recovery needs > neither recovery.trigger nor standby.trigger. When the server starts a crash > recovery without any trigger file, any recovery parameter settings in > postgresql.conf are ignored. Right? Yes. There are no conceptual changes, just the API. The goals are: visibility and reloading of recovery parameters, removal of special case code. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgstattuple documentation clarification
Andrew Dunstan writes: > Recently a client was confused because there was a substantial > difference between the reported table_len of a table and the sum of the > corresponding tuple_len, dead_tuple_len and free_space. The docs are > fairly silent on this point, and I agree that in the absence of > explanation it is confusing, so I propose that we add a clarification > note along the lines of: > The table_len will always be greater than the sum of the tuple_len, > dead_tuple_len and free_space. The difference is accounted for by > page overhead and space that is not free but cannot be attributed to > any particular tuple. > Or perhaps we should be more explicit and refer to the item pointers on > the page. I find "not free but cannot be attributed to any particular tuple" to be entirely useless weasel wording, not to mention wrong with respect to item pointers in particular. Perhaps we should start counting the item pointers in tuple_len. We'd still have to explain about page header overhead, but that would be a pretty small and fixed-size discrepancy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to drop old-style (V0) functions?
Robert Haas writes: > Well, I'm hoping there is a way to have a fast-path and a slow-path > without slowing things down too much. Seems like this discussion has veered way off into the weeds. I suggest we confine it to the stated topic; if you want to discuss ways to optimize V1 protocol (or invent a V2, which some of this sounds like it would need), that should be a distinct thread. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protect syscache from bloating with negative cache entries
Craig Ringer writes: > On 20 December 2016 at 21:59, Robert Haas wrote: >> We could implement this by having >> some process, like the background writer, >> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system >> every 10 minutes or so. > ... on a rolling basis. I don't understand why we'd make that a system-wide behavior at all, rather than expecting each process to manage its own cache. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, On 12/20/2016 05:55 AM, Ashutosh Sharma wrote: Attached is the revised patch. Please have a look and let me know your feedback. I have also changed the status for this patch in the upcoming commitfest to "needs review". Thanks. I was planning to submit a new version next week for CF/January, so I'll review your changes with the previous feedback in mind. Thanks for working on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for changes to recovery.conf API
On Tue, Dec 20, 2016 at 7:11 AM, Simon Riggs wrote: > On 19 December 2016 at 21:29, Peter Eisentraut > wrote: >> On 12/16/16 8:52 PM, Robert Haas wrote: >>> If the explanation is just a few sentences long, I see no reason not >>> to include it in the release notes. >> >> As far as I can tell from the latest posted patch, the upgrade >> instructions are >> >> - To trigger recovery, create an empty file recovery.trigger instead of >> recovery.conf. >> >> - All parameters formerly in recovery.conf are now regular >> postgresql.conf parameters. For backward compatibility, recovery.conf >> is read after postgresql.conf, but the parameters can now be put into >> postgresql.conf if desired. >> >> Some of that might be subject to patch review, but it's probably not >> going to be much longer than that. So I think that will fit well into >> the usual release notes section. > > Although that's right, there is more detail. > > The current list is this... based upon discussion in Tokyo dev meeting > > * Any use of the earlier recovery.conf or any of the old recovery > parameter names causes an ERROR, so we have a clear API break > > * To trigger recovery, create an empty recovery.trigger file. Same as > recovery.conf with standby_mode = off > > * To trigger standby, create an empty standby.trigger file. Same as > recovery.conf with standby_mode = on API for crash recovery will never be changed. That is, crash recovery needs neither recovery.trigger nor standby.trigger. When the server starts a crash recovery without any trigger file, any recovery parameter settings in postgresql.conf are ignored. Right? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Tue, Dec 20, 2016 at 2:46 PM, Michael Paquier wrote: > On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada > wrote: >> Do we need to consider the sorting method and the selecting k-th >> latest LSN method? > > Honestly, nah. Tests are showing that there are many more bottlenecks > before that with just memory allocation and parsing. I think that it's worth prototyping alternative algorithm, and measuring the performances of those alternative and current algorithms. This measurement should check not only the bottleneck but also how much each algorithm increases the time that backends need to wait for before they receive ack from walsender. If it's reported that current algorithm is enough "effecient", we can just leave the code as it is. OTOH, if not, let's adopt the alternative one. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protect syscache from bloating with negative cache entries
On 20 December 2016 at 21:59, Robert Haas wrote: > We could implement this by having > some process, like the background writer, > SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system > every 10 minutes or so. ... on a rolling basis. Otherwise that'll be no fun at all, especially with some of those lovely "we kept getting errors so we raised max_connections to 5000" systems out there. But also on more sensibly configured ones that're busy and want nice smooth performance without stalls. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0
On Sat, Dec 17, 2016 at 5:46 AM, Amit Kapila wrote: > Yeah, but we are planning to add a generic flag in WaitEvent structure > which can be used for similar purpose. However, as per your > suggestion, I have changed it only for Win32 port. Also, I kept the > change in ModifyWaitEvent API as that seems to be okay considering > 'reset' is a generic flag in WaitEvent structure. Well, we don't really need the change in ModifyWaitEvent if we have the change in WaitEventSetWaitBlock, right? I'd be inclined to ditch the former and keep the latter. Also, this doesn't look right: + for (cur_event = set->events; +cur_event < (set->events + set->nevents) +&& returned_events < nevents; +cur_event++) + { + if (cur_event->reset) + { + WaitEventAdjustWin32(set, cur_event); + cur_event->reset = false; + } + } There's no need to include returned_events < nevents in the loop condition here because returned_events isn't changing. I think I'd also guard the reset flag with #ifdef WIN32. If it's not properly supported elsewhere it's just a foot-gun, and there seems to be no reason to write the code to properly support it elsewhere, whatever that would mean. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Tue, Dec 20, 2016 at 1:44 AM, Alvaro Herrera wrote: > Fujii Masao wrote: > >> Regarding this feature, there are some loose ends. We should work on >> and complete them until the release. > > Please list these in https://wiki.postgresql.org/wiki/Open_Items so that we > don't forget. Yep, added! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication WIP
On 20/12/16 10:56, Erik Rijkers wrote: > On 2016-12-20 10:48, Petr Jelinek wrote: > > Here is another small thing: > > $ psql -d testdb -p 6972 > psql (10devel_logical_replication_20161220_1008_db80acfc9d50) > Type "help" for help. > > testdb=# drop publication if exists xxx; > ERROR: unrecognized object type: 28 > > > testdb=# drop subscription if exists xxx; > WARNING: relcache reference leak: relation "pg_subscription" not closed > DROP SUBSCRIPTION > > > I don't mind but I suppose eventually other messages need to go there > Yep, attached should fix it. DDL for completely new db objects surely touches a lot of places. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services diff --git a/src/backend/commands/dropcmds.c b/src/backend/commands/dropcmds.c index 61ff8f2..7080c4b 100644 --- a/src/backend/commands/dropcmds.c +++ b/src/backend/commands/dropcmds.c @@ -441,6 +441,10 @@ does_not_exist_skipping(ObjectType objtype, List *objname, List *objargs) } } break; + case OBJECT_PUBLICATION: + msg = gettext_noop("publication \"%s\" does not exist, skipping"); + name = NameListToString(objname); + break; default: elog(ERROR, "unrecognized object type: %d", (int) objtype); break; diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 25c8c34..bd27aac 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -492,6 +492,8 @@ DropSubscription(DropSubscriptionStmt *stmt) if (!HeapTupleIsValid(tup)) { + heap_close(rel, NoLock); + if (!stmt->missing_ok) ereport(ERROR, (errcode(ERRCODE_UNDEFINED_OBJECT), @@ -501,6 +503,7 @@ DropSubscription(DropSubscriptionStmt *stmt) ereport(NOTICE, (errmsg("subscription \"%s\" does not exist, skipping", stmt->subname))); + return; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 20, 2016 at 7:44 PM, Robert Haas wrote: > On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila wrote: >> On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas wrote: >>> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila >>> wrote: We have mainly four actions for squeeze operation, add tuples to the write page, empty overflow page, unlinks overflow page, make it free by setting the corresponding bit in overflow page. Now, if we don't log the changes to write page and freeing of overflow page as one operation, then won't query on standby can either see duplicate tuples or miss the tuples which are freed in overflow page. >>> >>> No, I think you could have two operations: >>> >>> 1. Move tuples from the "read" page to the "write" page. >>> >>> 2. Unlink the overflow page from the chain and mark it free. >>> >>> If we fail after step 1, the bucket chain might end with an empty >>> overflow page, but that's OK. >> >> If there is an empty page in bucket chain, access to that page will >> give an error (In WAL patch we are initializing the page instead of >> making it completely empty, so we might not see an error in such a >> case). > > It wouldn't be a new, uninitialized page. It would be empty of > tuples, not all-zeroes. > AFAIU we initialize page as all-zeros, but I think you are envisioning that we need to change it to a new uninitialized page. >> What advantage do you see by splitting the operation? > > It's simpler. The code here is very complicated and trying to merge > too many things into a single operation may make it even more > complicated, increasing the risk of bugs and making the code hard to > maintain without necessarily buying much performance. > Sure, if you find that way better, then we can change it, but the current patch treats it as a single operation. If after looking the patch you find it is better to change it into two operations, I will do so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Tue, Dec 20, 2016 at 4:51 AM, Alvaro Herrera wrote: > Amit Langote wrote: > >> diff --git a/src/backend/commands/tablecmds.c >> b/src/backend/commands/tablecmds.c >> index 1c219b03dd..6a179596ce 100644 >> --- a/src/backend/commands/tablecmds.c >> +++ b/src/backend/commands/tablecmds.c >> @@ -13297,8 +13297,10 @@ ATExecAttachPartition(List **wqueue, Relation rel, >> PartitionCmd *cmd) >> } >> } >> >> + /* It's safe to skip the validation scan after all */ >> if (skip_validate) >> - elog(NOTICE, "skipping scan to validate partition constraint"); >> + ereport(INFO, >> + (errmsg("skipping scan to validate partition >> constraint"))); > > Why not just remove the message altogether? That's certainly an option. It might be noise in some situations. On the other hand, it affects whether attaching the partition is O(1) or O(n), so somebody might well want to know. Or maybe they might be more likely to want a message in the reverse situation, telling them that the partition constraint DOES need to be validated. I'm not sure what the best user interface is here; thoughts welcome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila wrote: > On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas wrote: >> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila wrote: >>> We have mainly four actions for squeeze operation, add tuples to the >>> write page, empty overflow page, unlinks overflow page, make it free >>> by setting the corresponding bit in overflow page. Now, if we don't >>> log the changes to write page and freeing of overflow page as one >>> operation, then won't query on standby can either see duplicate tuples >>> or miss the tuples which are freed in overflow page. >> >> No, I think you could have two operations: >> >> 1. Move tuples from the "read" page to the "write" page. >> >> 2. Unlink the overflow page from the chain and mark it free. >> >> If we fail after step 1, the bucket chain might end with an empty >> overflow page, but that's OK. > > If there is an empty page in bucket chain, access to that page will > give an error (In WAL patch we are initializing the page instead of > making it completely empty, so we might not see an error in such a > case). It wouldn't be a new, uninitialized page. It would be empty of tuples, not all-zeroes. > What advantage do you see by splitting the operation? It's simpler. The code here is very complicated and trying to merge too many things into a single operation may make it even more complicated, increasing the risk of bugs and making the code hard to maintain without necessarily buying much performance. > Anyway, I think it is better to discuss this in WAL patch thread. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to drop old-style (V0) functions?
On Tue, Dec 20, 2016 at 8:44 AM, Andres Freund wrote: >> Advanced Server uses 256, and we had a customer complain recently that >> 256 wasn't high enough. So apparently the use case for functions with >> ridiculous numbers of arguments isn't exactly 0. > > Well, there's a cost/benefit analysis involved here, as in a lot of > other places. There's a lot of things one can conceive a use-case for, > doesn't mean it's worth catering for all of them. Clearly true. >> I mean, there's no reason that we can't use dynamic allocation here. >> If we put the first, say, 8 arguments in the FunctionCallInfoData >> itself and then separately allocated space any extras, the structure >> could be a lot smaller without needing an arbitrary limit on the >> argument count. > > Except that that'll mean additional overhead during evaluation of > function arguments, an overhead that everyone will have to pay. Suddenly > you need two loops that fill in arguments, depending on their > argument-number (i.e. additional branches in a thight spot). And not > being able to pass the null bitmap via an register argument also costs > everyone. Well, I'm hoping there is a way to have a fast-path and a slow-path without slowing things down too much. You're proposing something like that anyway, because you're proposing to put the first two arguments in actual function arguments and the rest someplace else. I wouldn't be surprised if 99% of function calls had 2 arguments or less because most people probably call functions mostly or entirely as operators and even user-defined functions don't necessarily have lots of arguments. But we wouldn't propose restricting all function calls to 2 arguments just so +(int4, int4) can be fast. There's clearly got to be a slow path for calls with more than 2 arguments and, if that's the case, I don't see why there can't be a really-slow path, perhaps guarded by unlikely(), for cases involving more than 8 or 16 or whatever. I find it really hard to believe that in 2016 that we can't find a way to make simple things fast and complicated things possible. Saying we can't make SQL-callable functions with large number of arguments work feels to me like saying that we have to go back to 8-character filenames with 3-character extensions for efficiency - in the 1980s, that argument might have held some water, but nobody buys it any more. Human time is worth more than computer time, and humans save time when programs don't impose inconveniently-small limits. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgstattuple documentation clarification
Recently a client was confused because there was a substantial difference between the reported table_len of a table and the sum of the corresponding tuple_len, dead_tuple_len and free_space. The docs are fairly silent on this point, and I agree that in the absence of explanation it is confusing, so I propose that we add a clarification note along the lines of: The table_len will always be greater than the sum of the tuple_len, dead_tuple_len and free_space. The difference is accounted for by page overhead and space that is not free but cannot be attributed to any particular tuple. Or perhaps we should be more explicit and refer to the item pointers on the page. Thoughts? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas wrote: > On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila wrote: >> We have mainly four actions for squeeze operation, add tuples to the >> write page, empty overflow page, unlinks overflow page, make it free >> by setting the corresponding bit in overflow page. Now, if we don't >> log the changes to write page and freeing of overflow page as one >> operation, then won't query on standby can either see duplicate tuples >> or miss the tuples which are freed in overflow page. > > No, I think you could have two operations: > > 1. Move tuples from the "read" page to the "write" page. > > 2. Unlink the overflow page from the chain and mark it free. > > If we fail after step 1, the bucket chain might end with an empty > overflow page, but that's OK. > If there is an empty page in bucket chain, access to that page will give an error (In WAL patch we are initializing the page instead of making it completely empty, so we might not see an error in such a case). What advantage do you see by splitting the operation? Anyway, I think it is better to discuss this in WAL patch thread. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protect syscache from bloating with negative cache entries
On Mon, Dec 19, 2016 at 6:15 AM, Kyotaro HORIGUCHI wrote: > Hello, recently one of my customer stumbled over an immoderate > catcache bloat. This isn't only an issue for negative catcache entries. A long time ago, there was a limit on the size of the relcache, which was removed because if you have a workload where the working set of relations is just larger than the limit, performance is terrible. But the problem now is that backend memory usage can grow without bound, and that's also bad, especially on systems with hundreds of long-lived backends. In connection-pooling environments, the problem is worse, because every connection in the pool eventually caches references to everything of interest to any client. Your patches seem to me to have some merit, but I wonder if we should also consider having a time-based threshold of some kind. If, say, a backend hasn't accessed a catcache or relcache entry for many minutes, it becomes eligible to be flushed. We could implement this by having some process, like the background writer, SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system every 10 minutes or so. When a process receives this signal, it sets a flag that is checked before going idle. When it sees the flag set, it makes a pass over every catcache and relcache entry. All the ones that are unmarked get marked, and all of the ones that are marked get removed. Access to an entry clears any mark. So anything that's not touched for more than 10 minutes starts dropping out of backend caches. Anyway, that would be a much bigger change from what you are proposing here, and what you are proposing here seems reasonable so I guess I shouldn't distract from it. Your email just made me think of it, because I agree that catcache/relcache bloat is a serious issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_sequence catalog
On 12/1/16 9:47 PM, Andreas Karlsson wrote: > I think this patch looks good now so I am setting it to ready for committer. committed, thanks -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to drop old-style (V0) functions?
On 2016-12-20 08:35:05 -0500, Robert Haas wrote: > On Tue, Dec 20, 2016 at 8:21 AM, Andres Freund wrote: > > On 2016-12-20 08:15:07 -0500, Robert Haas wrote: > >> On Tue, Dec 20, 2016 at 3:11 AM, Andres Freund wrote: > >> > I think a more efficient variant would make the function signature look > >> > something like: > >> > > >> > Datum /* directly returned argument */ > >> > pgfunc( > >> > /* extra information about function call */ > >> > FunctionCallInfo *fcinfo, > >> > /* bitmap of NULL arguments */ > >> > uint64_t nulls, > >> > /* first argument */ > >> > Datum arg0, > >> > /* second argument */ > >> > Datum arg1, > >> > /* returned NULL */ > >> > bool *isnull > >> > ); > >> > >> Yeah, that's kind of nice. I like the uint64 for nulls, although > >> FUNC_MAX_ARGS > 64 by default and certainly can be configured that > >> way. It wouldn't be a problem for any common cases, of course. > > > > Well, I suspect we'd have to change that then. Is anybody seriously > > interested in supporting FUNC_MAX_ARGS > 64? We don't have to make our > > life hard by supporting useless features... I'd even question using > > 64bit for this on 32bit platforms. > > Advanced Server uses 256, and we had a customer complain recently that > 256 wasn't high enough. So apparently the use case for functions with > ridiculous numbers of arguments isn't exactly 0. Well, there's a cost/benefit analysis involved here, as in a lot of other places. There's a lot of things one can conceive a use-case for, doesn't mean it's worth catering for all of them. > I mean, there's no reason that we can't use dynamic allocation here. > If we put the first, say, 8 arguments in the FunctionCallInfoData > itself and then separately allocated space any extras, the structure > could be a lot smaller without needing an arbitrary limit on the > argument count. Except that that'll mean additional overhead during evaluation of function arguments, an overhead that everyone will have to pay. Suddenly you need two loops that fill in arguments, depending on their argument-number (i.e. additional branches in a thight spot). And not being able to pass the null bitmap via an register argument also costs everyone. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila wrote: > We have mainly four actions for squeeze operation, add tuples to the > write page, empty overflow page, unlinks overflow page, make it free > by setting the corresponding bit in overflow page. Now, if we don't > log the changes to write page and freeing of overflow page as one > operation, then won't query on standby can either see duplicate tuples > or miss the tuples which are freed in overflow page. No, I think you could have two operations: 1. Move tuples from the "read" page to the "write" page. 2. Unlink the overflow page from the chain and mark it free. If we fail after step 1, the bucket chain might end with an empty overflow page, but that's OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make pg_basebackup -x stream the default
On Mon, Dec 19, 2016 at 7:51 PM, Vladimir Rusinov wrote: > > On Sat, Dec 17, 2016 at 2:37 PM, Magnus Hagander > wrote: >> >> Attached is an updated patch that does this. As a bonus it simplifies the >> code a bit. I also fixed an error message that I missed updating in the >> previous patch. > > > looks good to me. Yep, basically looks good to me. Here are some small comments. > while ((c = getopt_long(argc, argv, "D:F:r:RT:xX:l:nNzZ:d:c:h:p:U:s:S:wWvP", > long_options, &option_index)) != -1) 'x' should be removed from the above code. > To create a backup of a single-tablespace local database and compress this > with bzip2: > > $ pg_basebackup -D - -Ft | bzip2 > backup.tar.bz2 The above example in the docs needs to be modified. For example, add "-X fetch" into the above command so that pg_basebackup should not exit with an error. > The server must also be configured with max_wal_senders set high > enough to leave at least one session available for the backup. I think that it's better to explain explicitly here that max_wal_senders should be higher than one by default. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
Heikki, * Heikki Linnakangas (hlinn...@iki.fi) wrote: > Even if you have a separate "verifier type" column, it's not fully > normalized, because there's still a dependency between the verifier > and verifier type columns. You will always need to look at the > verifier type to make sense of the verifier itself. That's true- but you don't need to look at the verifier, or even have *access* to the verifier, to look at the verifier type. That is actually very useful when you start thinking about the downstream side of this- what about the monitoring tool which will want to check and make sure there are only certain verifier types being used? It'll have to be a superuser, or have access to some superuser security defined function, and that really sucks. I'm not saying that we would necessairly want the verifier type to be publicly visible, but being able to see it without being a superuser would be good, imv. > It's more convenient to carry the type information with the verifier > itself, in backend code, in pg_dump, etc. Sure, you could have a > separate "transfer" text format that has the prefix, and strip it > out when the datum enters the system. But it is even simpler to have > only one format, with the prefix, and use that everywhere. It's more convenient when you need to look at both- it's not more convenient when you only wish to look at the verifier type. Further, it means that we have to have a construct that assumes things about the verifier type and verifier- what if a verifier type came along that used a colon? We'd have to do some special magic to handle that correctly, and that just sucks, and anyone who is writing code to generically deal with these fields will end up writing that same code (or forgetting to, and not handling the case correctly). > It might make sense to add a separate column, to e.g. make it easier > to e.g. query for users that have an MD5 verifier. You could do > "WHERE rolverifiertype = 'md5'", instead of "WHERE rolpassword LIKE > 'md5%'". It's not a big difference, though. But even if we did that, > I would still love to have the type information *also* included with > the verifier itself, for convenience. And if we include it in the > verifier itself, adding a separate type column seems more trouble > than it's worth. I don't agree that it's "not a big difference." As I argue above- your approach also assumes that anyone who would like to investigate the verifier type should have access to the verifier itself, which I do not agree with. I also have a hard time buying the argument that it's really so much more convenient to have the verifier type included in the same string as the verifier that we should duplicate that information and then run the risk that we end up with the two not matching or that we won't ever run into complications down the road when our chosen separator causes us difficulties. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Time to drop old-style (V0) functions?
On Tue, Dec 20, 2016 at 8:21 AM, Andres Freund wrote: > On 2016-12-20 08:15:07 -0500, Robert Haas wrote: >> On Tue, Dec 20, 2016 at 3:11 AM, Andres Freund wrote: >> > I think a more efficient variant would make the function signature look >> > something like: >> > >> > Datum /* directly returned argument */ >> > pgfunc( >> > /* extra information about function call */ >> > FunctionCallInfo *fcinfo, >> > /* bitmap of NULL arguments */ >> > uint64_t nulls, >> > /* first argument */ >> > Datum arg0, >> > /* second argument */ >> > Datum arg1, >> > /* returned NULL */ >> > bool *isnull >> > ); >> >> Yeah, that's kind of nice. I like the uint64 for nulls, although >> FUNC_MAX_ARGS > 64 by default and certainly can be configured that >> way. It wouldn't be a problem for any common cases, of course. > > Well, I suspect we'd have to change that then. Is anybody seriously > interested in supporting FUNC_MAX_ARGS > 64? We don't have to make our > life hard by supporting useless features... I'd even question using > 64bit for this on 32bit platforms. Advanced Server uses 256, and we had a customer complain recently that 256 wasn't high enough. So apparently the use case for functions with ridiculous numbers of arguments isn't exactly 0. For most people it's not a big problem in practice, but actually I think that it's pretty lame that we have a limit at all. I mean, there's no reason that we can't use dynamic allocation here. If we put the first, say, 8 arguments in the FunctionCallInfoData itself and then separately allocated space any extras, the structure could be a lot smaller without needing an arbitrary limit on the argument count. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to drop old-style (V0) functions?
On 2016-12-20 08:15:07 -0500, Robert Haas wrote: > On Tue, Dec 20, 2016 at 3:11 AM, Andres Freund wrote: > > I think a more efficient variant would make the function signature look > > something like: > > > > Datum /* directly returned argument */ > > pgfunc( > > /* extra information about function call */ > > FunctionCallInfo *fcinfo, > > /* bitmap of NULL arguments */ > > uint64_t nulls, > > /* first argument */ > > Datum arg0, > > /* second argument */ > > Datum arg1, > > /* returned NULL */ > > bool *isnull > > ); > > Yeah, that's kind of nice. I like the uint64 for nulls, although > FUNC_MAX_ARGS > 64 by default and certainly can be configured that > way. It wouldn't be a problem for any common cases, of course. Well, I suspect we'd have to change that then. Is anybody seriously interested in supporting FUNC_MAX_ARGS > 64? We don't have to make our life hard by supporting useless features... I'd even question using 64bit for this on 32bit platforms. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
On 2016-12-20 08:10:29 -0500, Robert Haas wrote: > We could use the GUC assign hook to compute a mask and a shift, so > that this could be written as (CurrPos & mask_variable) == 0. That > would avoid the division instruction, though not the memory access. I suspect that'd be fine. > I hope this is all in the noise, though. Could very well be. > I know this is code is hot but I think it'll be hard to construct a > test case where the bottleneck is anything other than the speed at > which the disk can absorb bytes. I don't think that's really true. Heikki's WAL changes made a *BIG* difference. And pretty small changes in xlog.c can make noticeable throughput differences both in single and multi-threaded workloads. E.g. witnessed by the fact that the crc computation used to be a major bottleneck (and the crc32c instruction still shows up noticeably in profiles). SSDs have become fast enough that it's increasingly hard to saturate them. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time to drop old-style (V0) functions?
On Tue, Dec 20, 2016 at 3:11 AM, Andres Freund wrote: > I think a more efficient variant would make the function signature look > something like: > > Datum /* directly returned argument */ > pgfunc( > /* extra information about function call */ > FunctionCallInfo *fcinfo, > /* bitmap of NULL arguments */ > uint64_t nulls, > /* first argument */ > Datum arg0, > /* second argument */ > Datum arg1, > /* returned NULL */ > bool *isnull > ); Yeah, that's kind of nice. I like the uint64 for nulls, although FUNC_MAX_ARGS > 64 by default and certainly can be configured that way. It wouldn't be a problem for any common cases, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
On Tue, Dec 20, 2016 at 3:28 AM, Andres Freund wrote: > I do think this has the potential for negative performance > implications. Consider code like > /* skip over the page header */ > if (CurrPos % XLogSegSize == 0) > { > CurrPos += SizeOfXLogLongPHD; > currpos += SizeOfXLogLongPHD; > } > else > right now that's doable in an efficient manner, because XLogSegSize is > constant. If it's a variable and the compiler doesn't even know it's a > power-of-two, it'll have to do a full "div" - and that's quite easily > noticeable in a lot of cases. > > Now it could entirely be that the costs of this will be swamped by > everything else, but I'd not want to rely on it. We could use the GUC assign hook to compute a mask and a shift, so that this could be written as (CurrPos & mask_variable) == 0. That would avoid the division instruction, though not the memory access. I hope this is all in the noise, though. I know this is code is hot but I think it'll be hard to construct a test case where the bottleneck is anything other than the speed at which the disk can absorb bytes. I suppose we could set fsync=off and put the whole cluster on a RAMDISK to avoid those bottlenecks, but of course no real storage system behaves like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] too low cost of Bitmap index scan
2016-12-20 13:55 GMT+01:00 Robert Haas : > On Tue, Dec 20, 2016 at 2:13 AM, Pavel Stehule > wrote: > > 2016-12-19 23:28 GMT+01:00 Robert Haas : > >> On Sat, Dec 17, 2016 at 3:30 AM, Pavel Stehule > > >> wrote: > >> > -> Bitmap Heap Scan on "Zasilka" (cost=5097.39..5670.64 rows=1 > >> > width=12) > >> > (actual time=62.253..62.400 rows=3 loops=231) > >> ... > >> > When I disable bitmap scan, then the query is 6x time faster > >> > >> >-> Index Scan using "Zasilka_idx_Dopravce" on "Zasilka" > >> > (cost=0.00..30489.04 rows=1 width=12) (actual time=15.651..17.187 > rows=3 > >> > loops=231) > >> > Index Cond: ("Dopravce" = "Dopravce_Ridic_1"."ID") > >> >Filter: (("StavDatum" > (now() - '10 days'::interval)) AND > >> > (("Stav" = > >> > 34) OR ("Stav" = 29) OR ("Stav" = 180) OR ("Stav" = 213) OR ("Stav" = > >> > 46) OR > >> > (("Produkt" = 33) AND ("Stav" = 179)) OR ((("ZpetnaZasilka" = > >> > '-1'::integer) > >> > OR ("PrimaZasilka" = '-1'::integer)) AND ("Stav" = 40 > >> > Rows Removed by Filter: 7596 > >> > >> I'm not sure, but my guess would be that the query planner isn't > >> getting a very accurate selectivity estimate for that giant filter > >> condition, and that's why the cost estimate is off. > > > > maybe operator cost is too high? > > Hmm, seems like you'd be paying the operator cost either way. No? > It looks so this cost is much more significant in index scan feature Pavel > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
On Tue, Dec 20, 2016 at 6:37 AM, Heikki Linnakangas wrote: > It's more convenient to carry the type information with the verifier itself, > in backend code, in pg_dump, etc. Sure, you could have a separate "transfer" > text format that has the prefix, and strip it out when the datum enters the > system. But it is even simpler to have only one format, with the prefix, and > use that everywhere. I see your point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] too low cost of Bitmap index scan
On Tue, Dec 20, 2016 at 2:13 AM, Pavel Stehule wrote: > 2016-12-19 23:28 GMT+01:00 Robert Haas : >> On Sat, Dec 17, 2016 at 3:30 AM, Pavel Stehule >> wrote: >> > -> Bitmap Heap Scan on "Zasilka" (cost=5097.39..5670.64 rows=1 >> > width=12) >> > (actual time=62.253..62.400 rows=3 loops=231) >> ... >> > When I disable bitmap scan, then the query is 6x time faster >> >> >-> Index Scan using "Zasilka_idx_Dopravce" on "Zasilka" >> > (cost=0.00..30489.04 rows=1 width=12) (actual time=15.651..17.187 rows=3 >> > loops=231) >> > Index Cond: ("Dopravce" = "Dopravce_Ridic_1"."ID") >> >Filter: (("StavDatum" > (now() - '10 days'::interval)) AND >> > (("Stav" = >> > 34) OR ("Stav" = 29) OR ("Stav" = 180) OR ("Stav" = 213) OR ("Stav" = >> > 46) OR >> > (("Produkt" = 33) AND ("Stav" = 179)) OR ((("ZpetnaZasilka" = >> > '-1'::integer) >> > OR ("PrimaZasilka" = '-1'::integer)) AND ("Stav" = 40 >> > Rows Removed by Filter: 7596 >> >> I'm not sure, but my guess would be that the query planner isn't >> getting a very accurate selectivity estimate for that giant filter >> condition, and that's why the cost estimate is off. > > maybe operator cost is too high? Hmm, seems like you'd be paying the operator cost either way. No? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
On 12/16/2016 03:31 AM, Michael Paquier wrote: On Thu, Dec 15, 2016 at 9:48 PM, Heikki Linnakangas wrote: The only way to distinguish, is to know about every verifier kind there is, and check whether rolpassword looks valid as anything else than a plaintext password. And we already got tripped by a bug-of-omission on that once. If we add more verifier formats in the future, it's bound to happen again. Let's nip that source of bugs in the bud. Attached is a patch to implement what I have in mind. OK, I had a look at the patch proposed. -if (!pg_md5_encrypt(username, username, namelen, encrypted)) -elog(ERROR, "password encryption failed"); -if (strcmp(password, encrypted) == 0) -ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("password must not contain user name"))); This patch removes the only possible check for MD5 hashes that it has never been done in passwordcheck. It may be fine to remove it, but I would think that it is a good source of example regarding what could be done with MD5 hashes, though limited. So it seems to me that this check should involve as well pg_md5_encrypt on the username and compare if with the MD5 hash given by the caller. Actually, it does still perform that check. There's a new function, plain_crypt_verify, that passwordcheck uses now. plain_crypt_verify() is intended to work with any future hash formats we might introduce in the future (including SCRAM), so that passwordcheck doesn't need to know about all the hash formats. A simple ALTER USER role PASSWORD 'foo' causes a crash: Ah, fixed. +case PASSWORD_TYPE_PLAINTEXT: +shadow_pass = &shadow_pass[strlen("plain:")]; +break; It would be a good idea to have a generic routine able to get the plain password value. In short I think that we should reduce the amount of locations where "plain:" prefix is hardcoded. There is such a function included in the patch, get_plain_password(char *shadow_pass), actually. Contrib/passwordcheck uses it. I figured that in crypt.c itself, it's OK to do the above directly, but get_plain_password() is intended to be used elsewhere. Thanks for having a look! Attached is a new version, with that bug fixed. - Heikki 0001-Use-plain-prefix-for-plaintext-passwords-stored-in-p-2.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] invalid combination of options "-D - -F t -X stream" in pg_basebackup
On Tue, Dec 20, 2016 at 6:56 AM, Fujii Masao wrote: > On Tue, Dec 20, 2016 at 1:43 AM, Magnus Hagander > wrote: > > > > > > On Mon, Dec 19, 2016 at 5:39 PM, Fujii Masao > wrote: > >> > >> Hi, > >> > >> Isn't it better to forbid the conbination of the options "-D -", "-F t" > >> and > >> "-X stream" in pg_basebackup? This is obviously invalid setting and the > >> docs > >> warns this as follows. But currently users can specify such setting and > >> pg_basebackup can exit unexpectedly with an error. > >> > >> --- > >> If the value - (dash) is specified as target directory, the tar contents > >> will > >> be written to standard output, suitable for piping to for example gzip. > >> This is only possible if the cluster has no additional tablespaces. > >> --- > > > > > > Yes, definitely. I'd say that's an oversight in implementing the support > for > > stream-to-tar that it did not detect this issue. > > > > Do you want to provide a patch for it, or should I? > > What about the attached patch? > > +fprintf(stderr, _("Try \"%s --help\" for more information.\n"), > +progname); > > I added the above hint message because other codes checking invalid > options also have such hint messages. But there is no additional > useful information about valid combination of options in the help > messages, so I'm a bit tempted to remove the above hint message. > Looks good to me. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: pg_authid.rolpassword format (was Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol)
On 12/16/2016 05:48 PM, Robert Haas wrote: On Thu, Dec 15, 2016 at 8:40 AM, Stephen Frost wrote: * Heikki Linnakangas (hlinn...@iki.fi) wrote: On 12/14/2016 04:57 PM, Stephen Frost wrote: * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote: On 12/14/16 5:15 AM, Michael Paquier wrote: I would be tempted to suggest adding the verifier type as a new column of pg_authid Yes please. This discussion seems to continue to come up and I don't entirely understand why we keep trying to shove more things into pg_authid, or worse, into rolpassword. I understand the relational beauty of having a separate column for the verifier type, but I don't think it would be practical. I disagree. Me, too. I think the idea of moving everything into a separate table that allows multiple verifiers is probably not a good thing to do just right now, because that introduces a bunch of additional issues above and beyond what we need to do to get SCRAM implemented. There are administration and policy decisions to be made there that we should not conflate with SCRAM proper. However, Heikki's proposal seems to be that it's reasonable to force rolpassword to be of the form 'type:verifier' in all cases but not reasonable to have separate columns for type and verifier. Eh? I fear we'll just have to agree to disagree here, but I'll try to explain myself one more time. Even if you have a separate "verifier type" column, it's not fully normalized, because there's still a dependency between the verifier and verifier type columns. You will always need to look at the verifier type to make sense of the verifier itself. It's more convenient to carry the type information with the verifier itself, in backend code, in pg_dump, etc. Sure, you could have a separate "transfer" text format that has the prefix, and strip it out when the datum enters the system. But it is even simpler to have only one format, with the prefix, and use that everywhere. It might make sense to add a separate column, to e.g. make it easier to e.g. query for users that have an MD5 verifier. You could do "WHERE rolverifiertype = 'md5'", instead of "WHERE rolpassword LIKE 'md5%'". It's not a big difference, though. But even if we did that, I would still love to have the type information *also* included with the verifier itself, for convenience. And if we include it in the verifier itself, adding a separate type column seems more trouble than it's worth. For comparison, imagine that we added a column to pg_authid for a picture of the user, stored as a bytea. The picture can be in JPEG or PNG format. Looking at the first few bytes of the image, you can tell which one it is. Would it make sense to add a separate "type" column, to tell what format the image is in? I think it would be more convenient and robust to rely on the first bytes of the image data instead. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BUG: pg_stat_statements query normalization issues with combined queries
Hello again pgdevs, More investigations conclude that the information is lost by the parser. From a ;-separated string raw_parser returns a List which are directly the nodes of the statements, with empty statements silently skipped. The Node entry of high level statements (*Stmt) does NOT contain any location information, only some lower level nodes have it. A possible fix of this would be to add the query string location information systematically at the head of every high level *Stmt, which would then share both NodeTag and this information (say location and length). This would be a new intermediate kind of node of which all these statements would "inherit", say a "ParseNode" structure. The actual location information may be filled-in at quite a high level in the parser, maybe around "stmtmulti" and "stmt" rules, so it would not necessarily be too intrusive in the overall parser grammar. Then once the information is available, the query string may be cut where appropriate to only store the relevant string in pg_stat_statements or above. Would this approach be acceptable, or is modifying Nodes a no-go area? If it is acceptable, I can probably put together a patch and submit it. If not, I suggest to update the documentation to tell that pg_stat_statements does not work properly with combined queries. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers