RE: How to make partitioning scale better for larger numbers of partitions
From: David Rowley [mailto:david.row...@2ndquadrant.com] > > David has submitted multiple patches for PG 12, one of which speeds up > pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) > What challenges are there for future versions, and which of them are being > addressed by patches in progress for PG 12, and which issues are untouched? > > I've not submitted that for PG12 yet. I had other ideas about just > getting rid of the inheritance planner altogether, but so far don't > have a patch for that. Still uncertain if there are any huge blockers > to that either. Sorry, I seem to have misunderstood something. By the way, what do you think is the "ideal and should-be-feasible" goal and the "realistic" goal we can reach in the near future (e.g. PG 12)? Say, * Planning and execution time is O(log n), where n is the number of partitions * Planning time is O(log n), execution time is O(1) * Planning and execution time is O(1), where n is the number of partitions Regards Takayuki Tsunakawa
Re: How to make partitioning scale better for larger numbers of partitions
On 2018/07/13 14:49, Tsunakawa, Takayuki wrote: > From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] >> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase >> is pretty significant and gets worse as the number of partitions grows. >> I >> had intended to fix that in PG 11, but we could only manage to get part >> of >> that work into PG 11, with significant help from David and others. So, >> while PG 11's overhead of partitioning during planning is less than PG >> 10's due to improved pruning algorithm, it's still pretty far from ideal, >> because it isn't just the pruning algorithm that had overheads. In fact, >> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still >> carry the overhead that was in PG 10. > > David has submitted multiple patches for PG 12, one of which speeds up > pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) I don't think he has posted a new patch for improving the speed for UDPATE/DELETE planning yet, although I remember he had posted a PoC patch back in February or March. > What challenges are there for future versions, and which of them are being > addressed by patches in progress for PG 12, and which issues are untouched? The immediate one I think is to refactor the planner such that the new pruning code, that we were able to utilize for SELECT in PG 11, can also be used for UPDATE/DELETE. Refactoring needed to replace the pruning algorithm was minimal for SELECT, but not so much for UPDATE/DELETE. Once we're able to utilize the new pruning algorithm for all the cases, we can move on to refactoring to avoid locking and opening of all partitions. As long as we're relying on constraint exclusion for partition pruning, which we still do for UPDATE/DELETE, we cannot do that because constraint exclusion has to look at each partition individually. The UPDATE/DELETE planning for partitioning using huge memory and CPU is a pretty big issue and refactoring planner to avoid that may be what's hardest of all the problems to be solved here. >> The overheads I mention stem from >> the fact that for partitioning we still rely on the old planner code >> that's used to perform inheritance planning, which requires to lock and >> open *all* partitions. We have so far been able to refactor just enough >> to use the new code for partition pruning, but there is much refactoring >> work left to avoid needlessly locking and opening all partitions. > > I remember the issue of opening and locking all partitions was discussed > before. Quite a few times I suppose. > Does this relate only to the planning phase? If the benchmark contains queries that will need to access just one partition, then yes the planning part has is the biggest overhead. Execution-time overhead is limited to having an extra, possibly needless, Append node, but I know David has patch for that too. Thanks, Amit
Re: How to make partitioning scale better for larger numbers of partitions
On 13 July 2018 at 14:58, Kato, Sho wrote: > Of course I'm sure table partitioning work well with up to a hundred > partitions as written on the postgresql document. > > But, my customer will use partitioned table with 1.1k leaf partitions. > > So, we need to improve performance. > > Any ideas? It would be really good if you could review and test my partitioning patches in the current commitfest. These are the ones authored by me with the word "partition" in the title. These 4 patches don't solve all the problems, but they do make some good gains in some areas. I've still more work to do, so the earlier I can be finished with the 4 that are there now, the more I can focus on the rest. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: How to make partitioning scale better for larger numbers of partitions
On 13 July 2018 at 17:49, Tsunakawa, Takayuki wrote: > David has submitted multiple patches for PG 12, one of which speeds up > pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) > What challenges are there for future versions, and which of them are being > addressed by patches in progress for PG 12, and which issues are untouched? I've not submitted that for PG12 yet. I had other ideas about just getting rid of the inheritance planner altogether, but so far don't have a patch for that. Still uncertain if there are any huge blockers to that either. Join search needs performed multiple times, but a good advantage will be that we can take advantage of partition pruning to only join search the non-pruned partitions. The reason I change my mind about the patch you're talking about is that it meant that RangeTblEntries needed to keep the same relation id in the inheritance planner as they did in the grouping planner, and another patch I have semi-baked delays building both RelOptInfo and RangeTblEntry records for partitions until after pruning. Without the RangeTblEntry it was difficult to ensure the relids were in lock-step between the two planners. There are ways to do it, it just didn't look pretty. Hoping to have something later in the cycle. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: How to make partitioning scale better for larger numbers of partitions
On Fri, Jul 13, 2018 at 05:49:20AM +, Tsunakawa, Takayuki wrote: > David has submitted multiple patches for PG 12, one of which speeds up > pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) > What challenges are there for future versions, and which of them are being > addressed by patches in progress for PG 12, and which issues are untouched? Here's an known issue which I'm not sure is on anybody's list: https://www.postgresql.org/message-id/flat/4F989CD8.8020804%40strategicdata.com.au#4f989cd8.8020...@strategicdata.com.au => planning of UPDATE/DELETE (but not SELECT) takes huge amount of RAM when run on parent with many childs. We don't typically have UPDATEs, and those we do are against the child tables; but ran into this last month with a manual query on parent causing OOM. I worked around it, but keep meaning to revisit to see if this changed at all in v11 (very brief testing suggests not changed). Justin
Re: Cannot dump foreign key constraints on partitioned table
Thanks for the prompt fix, patch [1] works for me. 1] https://postgr.es/m/20180712184537.5vjwgxlbuiomomqd@alvherre.pgsql Regards, Amul
RE: How to make partitioning scale better for larger numbers of partitions
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] > For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase > is pretty significant and gets worse as the number of partitions grows. > I > had intended to fix that in PG 11, but we could only manage to get part > of > that work into PG 11, with significant help from David and others. So, > while PG 11's overhead of partitioning during planning is less than PG > 10's due to improved pruning algorithm, it's still pretty far from ideal, > because it isn't just the pruning algorithm that had overheads. In fact, > PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still > carry the overhead that was in PG 10. David has submitted multiple patches for PG 12, one of which speeds up pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.) What challenges are there for future versions, and which of them are being addressed by patches in progress for PG 12, and which issues are untouched? > The overheads I mention stem from > the fact that for partitioning we still rely on the old planner code > that's used to perform inheritance planning, which requires to lock and > open *all* partitions. We have so far been able to refactor just enough > to use the new code for partition pruning, but there is much refactoring > work left to avoid needlessly locking and opening all partitions. I remember the issue of opening and locking all partitions was discussed before. Does this relate only to the planning phase? Kato-san, could you try pgbench -M prepared? Regards Takayuki Tsunakawa
Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full
On Sat, Apr 14, 2018 at 3:48 AM, Julian Markwort wrote: > [a patch] Hello Julian, Could you please post a rebased patch? I haven't reviewed or tested any code yet, but here's some proof-reading: + This behaviour is similar to the cert autentication method "behavior" (our manual is written in en_US, "cd doc/src/sgml ; git grep behavior | wc -l" -> 895, "git grep behaviour" -> 0). cert "authentication" + chain, but it will also check whether the username or it's + mapping match the common name (CN) of the provided certificate. "its" "matches" + Note that certificate chain validation is always ensured when + cert authentication method is used extra space when *the* ... + In this case, the CN (nommon name) provided in "common name" + CN (Common Name) in the certificate matches "common"? (why a capital letter here?) This line isn't modified by your patch, but I saw it while in proof-reading mode: *err_msg = "clientcert can not be set to 0 when using \"cert\" authentication"; I think "can not" is usually written "cannot"? > slightly offtopic opinion: > While creating the test cases, I stumbled upon the problem of missing > depencies to run the tests... > It's complicated enough that the binaries used by these perl tests are not > named similar to the packages which provide them (the 'prove' binary is > supplied by 'Test-Harness'), so maybe in the interest of providing a lower > entry-barrier to running these tests, we could give a more detailed error > message in the configure script, when using --enable-tap-tests ? Yeah. The packages to install depend on your operating system, and in some cases (macOS, Windows?) which bolt-on package thingamajig you use, though. Perhaps the READMEs could be improved with details for systems we have reports about (like the recently added "Requirements" section of src/test/ldap/README). -- Thomas Munro http://www.enterprisedb.com
Re: file cloning in pg_upgrade and CREATE DATABASE
On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut wrote: > - XFS has (optional) reflink support. This file system is probably more > widely used than Btrfs. > > - Linux and glibc have a proper function to do this now. > > - APFS on macOS supports file cloning. TIL that Solaris 11.4 (closed) ZFS supports reflink() too. Sadly, it's not in OpenZFS though I see numerous requests and discussions... (Of course you can just clone the whole filesystem and then pg_upgrade the clone in-place). -- Thomas Munro http://www.enterprisedb.com
Re: Cannot dump foreign key constraints on partitioned table
On Thu, Jul 12, 2018 at 11:34:43PM -0400, Alvaro Herrera wrote: > I'm not sure what to *do* with the partition, though :-) I don't think > there's a nice way to verify that the FK actually exists, or that > catalog rows are set in such-and-such way, after restoring this. > The pg_restore tests are marked TODO in the suite. I think that'll have > to wait. Ouch, right. My eyes are betraying me. I swear when I tested that I saw an ALTER TABLE ADD CONSTRAINT command generated as well for each partition on top of the parent :) But only the parent needs to have the definition, so your test is sufficient. Sorry for the noise. -- Michael signature.asc Description: PGP signature
Re: Checkpoint not retrying failed fsync?
On Tue, Jun 12, 2018 at 3:31 PM, Thomas Munro wrote: > I was about to mark this patch "rejected" and forget about it, since > Craig's patch makes it redundant. But then I noticed that Craig's > patch doesn't actually remove the retry behaviour completely: it > promotes only EIO and ENOSPC to PANIC. Rejected, since this will have to be dealt with one way or another in that other thread. -- Thomas Munro http://www.enterprisedb.com
Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record
On Fri, Jul 13, 2018 at 08:13:39AM +0500, Andrey V. Lepikhov wrote: > Timestamp in a backup history file not correspond to any WAL record and > can't be bind with a time of backup exactly. > In my opinion, keeping timestamp in XLOG_BACKUP_END is more reliable, safe > and easy way for recovering a database to a specific time. Like Andres and Fujii-san, I don't really see the point of complicating the code for that. If your goal is to stop WAL replay once consistency has been reached, then just use recovery_target = 'immediate' if you fetch the data from a WAL archive and that there are still segments after the consistency point. Or just use a self-contained backup which has all the WAL it needs without restore_command. If your goal is to make sure that the timestamp set in recovery.conf is not older than the moment the backup has ended, then the backup history file has what you are looking for. In short, in any case there is no point in duplicating data which already exists. You can as well use recovery_target_lsn to point exactly at the time a backup has ended as returned by pg_stop_backup, and this even saves maths with timestamps. -- Michael signature.asc Description: PGP signature
Re: How to make partitioning scale better for larger numbers of partitions
Kato-san, On 2018/07/13 11:58, Kato, Sho wrote: > Hi, > > I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no > sub-partitioned tables. Thanks for sharing the results. > But, statement latencies on a partitioned table is much slower than on a > non-partitioned table. > > UPDATE latency is 210 times slower than a non-partitioned table. > SELECT latency is 36 times slower than a non-partitioned table. > Surprisingly INSERT latency is almost same. Yes, INSERT comes out ahead because there is no overhead of partitioning in the planning phase. As David Rowley reported on the nearby thread ("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still significant overhead during its execution, so we're still a bit a fair bit away from the best possible performance. For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase is pretty significant and gets worse as the number of partitions grows. I had intended to fix that in PG 11, but we could only manage to get part of that work into PG 11, with significant help from David and others. So, while PG 11's overhead of partitioning during planning is less than PG 10's due to improved pruning algorithm, it's still pretty far from ideal, because it isn't just the pruning algorithm that had overheads. In fact, PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still carry the overhead that was in PG 10. The overheads I mention stem from the fact that for partitioning we still rely on the old planner code that's used to perform inheritance planning, which requires to lock and open *all* partitions. We have so far been able to refactor just enough to use the new code for partition pruning, but there is much refactoring work left to avoid needlessly locking and opening all partitions. Thanks, Amit
Re: Temporary tables prevent autovacuum, leading to XID wraparound
On Tue, Mar 13, 2018 at 08:08:48AM +, Tsunakawa, Takayuki wrote: > From: Tom Lane [mailto:t...@sss.pgh.pa.us] >> On the whole, my vote is to fix and apply step 2, and leave it at that. Yeah, I have been thinking about the idea 1 mentioned above, or in short clean up the temporary namespace at connection start instead of first-use of it, and while that would make the cleanup more aggressive, it could be possible as well that only having autovacuum do the work could be enough, so I am +-0 on this idea. > Done. It seems to work well. I have looked at the patch proposed. + /* Does the backend own the temp schema? */ + if (proc->tempNamespaceId != namespaceID) + return false; I have a very hard time believing that this is safe lock-less, and a spin lock would be enough it seems. + /* Is the backend connected to this database? */ + if (proc->databaseId != MyDatabaseId) + return false; Wouldn't it be more interesting to do this cleanup as well if the backend is *not* connected to the database autovacuum'ed? This would make the cleanup more aggresive, which is better. -- Michael signature.asc Description: PGP signature
Re: Cache invalidation after authentication (on-the-fly role creation)
On Fri, Jul 13, 2018 at 6:52 AM, Tom Lane wrote: > Thomas Munro writes: >> I suppose the call to AcceptInvalidationMessages() could go at the end >> of ClientAuthentication(). That'd be closer to the code that creates >> the negative entry and immediately after the code that might modify >> the catalogs. Or in PeformAuthentication(), its caller. Or in >> IntializeSessionUserId() immediately before we try to look up the >> name, but that's also called by background workers that don't need it. > > I think my preference would be to put it in InitializeSessionUserId > so that it's clearly associated with the syscache lookups we're trying > to protect. Thanks. Pushed to master that way. With this change, I can successfully use a little PAM module that authenticates with an LDAP server and then creates, grants and revokes roles as necessary based on LDAP groups. It's not as good as Stephen's grand plan, but it's dead simple. Without the change, it still works, but the first login attempt for a given user fails. I can live with that in the back branches. > I don't entirely buy your argument that background workers > don't need up-to-date info for this. Yeah, a hypothetical background worker that starts up and then waits to be told the name of the role to use could suffer from this problem. -- Thomas Munro http://www.enterprisedb.com
RE: How to make partitioning scale better for larger numbers of partitions
>I wondered if you compared to PG10 or to inheritence-partitioning (parent with >relkind='r' and either trigger or rule or >INSERT/UPDATE directly into child) ? Thank you for your reply. I compared to PG11beta2 with non-partitioned table. Non-partitioned table has 1100 records in one table. Partitioned table has one record on each leaf partitions. Regards, -Original Message- From: Justin Pryzby [mailto:pry...@telsasoft.com] Sent: Friday, July 13, 2018 12:11 PM To: Kato, Sho/加藤 翔 Subject: Re: How to make partitioning scale better for larger numbers of partitions Hi, On Fri, Jul 13, 2018 at 02:58:53AM +, Kato, Sho wrote: > I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no > sub-partitioned tables. > But, statement latencies on a partitioned table is much slower than on a > non-partitioned table. > > UPDATE latency is 210 times slower than a non-partitioned table. > SELECT latency is 36 times slower than a non-partitioned table. > Surprisingly INSERT latency is almost same. I wondered if you compared to PG10 or to inheritence-partitioning (parent with relkind='r' and either trigger or rule or INSERT/UPDATE directly into child) ? Justin
Re: Cannot dump foreign key constraints on partitioned table
On 2018-Jul-13, Michael Paquier wrote: > On Thu, Jul 12, 2018 at 02:45:37PM -0400, Alvaro Herrera wrote: > > Thanks, looks good. I propose to add following pg_dump test to ensure > > this stays fixed. > > Thanks for adding the test. I was looking at a good way to add a test > but could not come up with something which can be summed up with one > query for create_sql, so what you have here is nice. Could you add an > extra test with a partition of dump_test.test_table_fk? Children should > have the FK defined as well with relhastriggers set to true, still when > I tested if the partitioned was not scanned for its FK, then its > children partition also missed it. So I think that it is important to > check that the FK is defined for all members of the partition tree. Hmm. The pg_dump tests make it easy to create a partition (in fact I had already modified the test to add one after submitting): diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index 8860928df1..666760c0d8 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -635,7 +635,10 @@ my %tests = ( create_order => 4, create_sql => 'CREATE TABLE dump_test.test_table_fk ( col1 int references dump_test.test_table) - PARTITION BY RANGE (col1);', + PARTITION BY RANGE (col1); + CREATE TABLE dump_test.test_table_fk_1 + PARTITION OF dump_test.test_table_fk + FOR VALUES FROM (0) TO (10);', regexp => qr/ \QADD CONSTRAINT test_table_fk_col1_fkey FOREIGN KEY (col1) REFERENCES dump_test.test_table\E /xm, I'm not sure what to *do* with the partition, though :-) I don't think there's a nice way to verify that the FK actually exists, or that catalog rows are set in such-and-such way, after restoring this. The pg_restore tests are marked TODO in the suite. I think that'll have to wait. > I am fine to add the test myself and to push if you need help. Of > course feel free to do it yourself if you want. Either way is fine for > me. No worries -- I'll push it tomorrow morning. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record
On 10.07.2018 22:26, Fujii Masao wrote: On Tue, Jul 10, 2018 at 6:41 PM, Andrey V. Lepikhov wrote: On 10.07.2018 06:45, Andres Freund wrote: Hi, On 2018-07-10 06:41:32 +0500, Andrey V. Lepikhov wrote: This functionality is needed in practice when we have to determine a recovery time of specific backup. What do you mean by "recovery time of specific backup"? recovery time - is a time point where backup of PostgreSQL database instance was made. Performing database recovery, we want to know what point in time the restored database will correspond to. This functionality refers to improving the usability of pg_basebackup and pg_probackup utilities. Why don't you use a backup history file for that purpose? Timestamp in a backup history file not correspond to any WAL record and can't be bind with a time of backup exactly. In my opinion, keeping timestamp in XLOG_BACKUP_END is more reliable, safe and easy way for recovering a database to a specific time. Regards, -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Re: Concurrency bug in UPDATE of partition-key
On Thu, Jul 12, 2018 at 10:14 PM, Andres Freund wrote: > On 2018-07-11 09:16:33 -0400, Alvaro Herrera wrote: >> On 2018-Jul-11, Amit Kapila wrote: >> >> > Attached, please find an updated patch based on comments by Alvaro. >> > See, if this looks okay to you guys. >> >> LGTM as far as my previous comments are concerned. > > I see Amit pushed a patch here yesterday > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=40ca70ebcc9d0bc1c02937b27c44b2766617e790 > , is there a need to keep the open item open, > No, I have moved the item from the open issues list. I was waiting for the build farm, yesterday, it has shown some failure after this commit, but it turns out to be some unrelated random failure. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
How to make partitioning scale better for larger numbers of partitions
Hi, I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no sub-partitioned tables. But, statement latencies on a partitioned table is much slower than on a non-partitioned table. UPDATE latency is 210 times slower than a non-partitioned table. SELECT latency is 36 times slower than a non-partitioned table. Surprisingly INSERT latency is almost same. Of course I'm sure table partitioning work well with up to a hundred partitions as written on the postgresql document. But, my customer will use partitioned table with 1.1k leaf partitions. So, we need to improve performance. Any ideas? The results of pgbench and perf are listed below. pgbench results --- transaction type: update.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 180 s number of transactions actually processed: 648 latency average = 278.202 ms tps = 3.594512 (including connections establishing) tps = 3.594545 (excluding connections establishing) statement latencies in milliseconds: 0.011 \set aid random(1, 1100 * 1) 0.004 \set delta random(-5000, 5000) 0.038 BEGIN; 277.005 UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid; 1.140 END; transaction type: select.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 180 s number of transactions actually processed: 19415 latency average = 9.281 ms tps = 107.745068 (including connections establishing) tps = 107.746067 (excluding connections establishing) statement latencies in milliseconds: 0.800 \set aid random(1, 1100 * 1) 0.137 \set delta random(-5000, 5000) 1.351 BEGIN; 4.941 SELECT abalance FROM test.accounts WHERE aid = :aid; 2.052 END; transaction type: insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 180 s number of transactions actually processed: 31895 latency average = 5.654 ms tps = 176.865541 (including connections establishing) tps = 176.867086 (excluding connections establishing) statement latencies in milliseconds: 2.083 \set aid random(1, 1100 * 1) 0.003 \set delta random(-5000, 5000) 0.029 BEGIN; 3.222 INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP); 0.317 END; Top 10 of perf report UPDATE: 21.33% postgres postgres [.] range_table_mutator 12.57% postgres postgres [.] AllocSetAlloc 4.97% postgres postgres [.] palloc 4.48% postgres postgres [.] make_one_rel 3.96% postgres postgres [.] lappend 2.74% postgres [kernel.kallsyms] [k] get_page_from_freelist 1.87% postgres postgres [.] setup_append_rel_array 1.68% postgres [kernel.kallsyms] [k] list_del 1.64% postgres [kernel.kallsyms] [k] __alloc_pages_nodemask 1.62% postgres [kernel.kallsyms] [k] unmap_vmas SELECT: 14.72% postgres postgres [.] AllocSetAlloc 5.14% postgres postgres [.] hash_search_with_hash_value 4.23% postgres postgres [.] palloc 4.06% postgres postgres [.] MemoryContextAllocZeroAligned 2.61% postgres postgres [.] copyObjectImpl 2.34% postgres postgres [.] expression_tree_mutator 2.13% postgres [kernel.kallsyms] [k] _spin_lock 1.91% postgres postgres [.] lappend 1.59% postgres [kernel.kallsyms] [k] __link_path_walk 1.50% postgres postgres [.] set_rel_size INSERT: 20.75% postgres postgres [.] hash_search_with_hash_value 6.03% postgres postgres [.] hash_any 4.88% postgres postgres [.] AllocSetAlloc 4.05% postgres postgres [.] LWLockRelease 4.05% postgres postgres [.] LWLockAcquire 3.27% postgres postgres [.] oid_cmp 3.06% postgres postgres [.] SearchCatCache 2.97% postgres postgres [.] LockReleaseAll 2.57% postgres postgres [.] pg_qsort 2.37% postgres postgres [.] hash_seq_search The following is information on the environment used for the benchmark. Server spec --- Server has 16 cpu. Memory size is 264GB. Database directory is on SSD. database tuning --- shared_buffers = 102GB max_locks_per_transactions = 100 postgresql version -- 11beta2 + patch1 + patch2 patch1: Allow direct lookups of AppendRelInfo by child relid commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch https://commitfest.postgresql.org/18/1690/ table definition create table test.accounts(aid serial, abalance int) partition by range(aid)); create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zo
Re: requested timeline ... does not contain minimum recovery point ...
On 2018-07-12 19:22:50 -0700, Christophe Pettus wrote: > > > On Jul 12, 2018, at 17:52, Michael Paquier wrote: > > Wild guess: you did not issue a checkpoint on the promoted standby > > before running pg_rewind. > > I don't believe a manual checkpoint was done on the target (promoted standby, > new master), but it did one as usual during startup after the timeline switch: > > > 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG: > > checkpoint starting: force > > > The pg_rewind was started about 90 seconds later. Note that that message doesn't indicate a completed checkpoint, just that one started. Do you see a "checkpoint complete: wrote ..." message before the rewind started? Greetings, Andres Freund
RE: ON CONFLICT DO NOTHING on pg_dump
>I noticed one more thing: pg_dumpall.c doesn't really need to prohibit >--on-conflict-do-nothing without --insert. Its existing validation rejects >illegal >combinations of the settings that are *not* passed on to pg_dump. It seems OK >to >just pass those on and let pg_dump complain. For example, if you say >"pg_dumpall >--data-only --schema-only", it's pg_dump that complains, not pg_dumpall. I >think we >should do the same thing here. Thank you for the clarification. I didn't give thought to pg_dumpall internally running pg_dump. >Pushed, with those changes. Thanks!
Re: requested timeline ... does not contain minimum recovery point ...
> On Jul 12, 2018, at 19:22, Christophe Pettus wrote: > > >> On Jul 12, 2018, at 17:52, Michael Paquier wrote: >> Wild guess: you did not issue a checkpoint on the promoted standby >> before running pg_rewind. > > I don't believe a manual checkpoint was done on the target (promoted standby, > new master), but it did one as usual during startup after the timeline switch: > >> 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG: >> checkpoint starting: force > > The pg_rewind was started about 90 seconds later. That being said, the pg_rewind output seems to indicate that the old divergence point was still being picked up, rather than the one on timeline 104: > servers diverged at WAL position A58/5000 on timeline 103 > rewinding from last common checkpoint at A58/4E0689F0 on timeline 103 -- -- Christophe Pettus x...@thebuild.com
Re: pg_create_logical_replication_slot returns text instead of name
On Fri, Jul 13, 2018 at 11:22 AM, Michael Paquier wrote: > On Fri, Jul 13, 2018 at 11:14:01AM +0900, Masahiko Sawada wrote: >> Hmm, I'm also not sure about the policy of usage of name data type for >> columns that show an object identifier on external servers. There is a >> similar case; we have the pubname in pg_subscritpion as name type >> whereas the subpublications in pg_subscription is text[] type. > > Let's not bother then. In the case of the function you pointed out the > intention behind the code was clear anyway, so that's better now. Yeah, I agreed with you. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: requested timeline ... does not contain minimum recovery point ...
> On Jul 12, 2018, at 17:52, Michael Paquier wrote: > Wild guess: you did not issue a checkpoint on the promoted standby > before running pg_rewind. I don't believe a manual checkpoint was done on the target (promoted standby, new master), but it did one as usual during startup after the timeline switch: > 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG: checkpoint > starting: force The pg_rewind was started about 90 seconds later. -- -- Christophe Pettus x...@thebuild.com
Re: pg_create_logical_replication_slot returns text instead of name
On Fri, Jul 13, 2018 at 11:14:01AM +0900, Masahiko Sawada wrote: > Hmm, I'm also not sure about the policy of usage of name data type for > columns that show an object identifier on external servers. There is a > similar case; we have the pubname in pg_subscritpion as name type > whereas the subpublications in pg_subscription is text[] type. Let's not bother then. In the case of the function you pointed out the intention behind the code was clear anyway, so that's better now. -- Michael signature.asc Description: PGP signature
RE: Locking B-tree leafs immediately in exclusive mode
On Mon, July 9, 2018 at 5:25 PM, Simon Riggs wrote: > Please can you check insertion with the index on 2 keys > 1st key has 10,000 values > 2nd key has monotonically increasing value from last 1st key value > > So each session picks one 1st key value > Then each new INSERTion is a higher value of 2nd key > so 1,1, then 1,2 then 1,3 etc > > Probably easier to do this with a table like this > > CREATE UNLOGGED TABLE ordered2 (id integer, logdate timestamp default > now(), value text not null, primary key (id, logdate)); > > # script_ordered2.sql > \set i random(1, 1) > INSERT INTO ordered2 (id, value) VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz'); > > Thanks I tried to do this, but I might be mistaken your intention, so please specify if I am wrong. While script_ordered.sql supposes that there is one contention point on the most right leaf node, script_ordered2.sql supposes that there are some contention points on some leaf nodes, is it right? I experimented with key1 having 1 values, but there are no difference in the results compared to unordered.sql one, so I experimented with key1 having 1, 2, 3, 5, 10, and 100 values. Also, If I created primary key, "ERROR: duplicate key value violates unique constraint "ordered2_pkey" happened, so I created non-unique key. #DDL CREATE UNLOGGED TABLE ordered2 (id integer, logdate timestamp default now(), value text not null); CREATE INDEX ordered2_key ON ordered2 (id, logdate); # script_ordered2.sql \set i random(1, 100) #second value is 1, 2, 3, 5, 10, or 100 INSERT INTO ordered2 (id, value) VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz'); # ordered2 results, key1 having 1, 2, 3, 5, 10, and 100 values master, key1 with 1 values: 236428 master, key1 with 2 values: 292248 master, key1 with 3 values: 340980 master, key1 with 5 values: 362808 master, key1 with 10 values: 379525 master, key1 with 100 values: 405265 patched, key1 with 1 values: 295862 patched, key1 with 2 values: 339538 patched, key1 with 3 values: 355793 patched, key1 with 5 values: 371332 patched, key1 with 10 values: 387731 patched, key1 with 100 values: 405115 From an attached graph("some_contention_points_on_leaf_nodes.png"), as contention points dispersed, we can see that TPS is increased and TPS difference between master and patched version becomes smaller. Yoshikazu Imai
Re: pg_create_logical_replication_slot returns text instead of name
On Fri, Jul 13, 2018 at 9:48 AM, Michael Paquier wrote: > On Thu, Jul 12, 2018 at 10:18:53PM +0900, Michael Paquier wrote: >> That's a small thing, but I agree with you. As far as I can see slot >> names are always mapped with the name type. I'll push that tomorrow if >> there are no objections. > > Pushed, with a catalog version bump. > Thank you! > While double-checking things, I have noticed that pg_stat_wal_receiver > also uses text for slot names. I am not sure if this one is worth > bothering as the docs point out the correct type, just mentioning. Hmm, I'm also not sure about the policy of usage of name data type for columns that show an object identifier on external servers. There is a similar case; we have the pubname in pg_subscritpion as name type whereas the subpublications in pg_subscription is text[] type. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: ON CONFLICT DO NOTHING on pg_dump
On Fri, Jul 13, 2018 at 12:33 PM, Ideriha, Takeshi wrote: >>+Add ON CONFLICT DO NOTHING clause in the INSERT commands. >> >>I think this would be better as: Add ON CONFLICT DO >>NOTHING to >>INSERT commands. > > Agreed. > >>+printf(_(" --on-conflict-do-nothing dump data as INSERT >>commands with ON CONFLICT DO NOTHING \n")); >> >>That's slightly misleading... let's just use the same wording again, eg "add >>ON >>CONFLICT DO NOTHING to INSERT commands". > > Agreed. But you forgot fixing it at pg_dump.c. > So could you please fix this and commit it? Right, thanks. I noticed one more thing: pg_dumpall.c doesn't really need to prohibit --on-conflict-do-nothing without --insert. Its existing validation rejects illegal combinations of the settings that are *not* passed on to pg_dump. It seems OK to just pass those on and let pg_dump complain. For example, if you say "pg_dumpall --data-only --schema-only", it's pg_dump that complains, not pg_dumpall. I think we should do the same thing here. Pushed, with those changes. Thanks for the patch and the reviews! -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] WAL logging problem in 9.4.3?
On Thu, Jul 12, 2018 at 05:12:21PM +0300, Heikki Linnakangas wrote: > Doesn't have to be a trigger, could be a CHECK constraint, datatype input > function, etc. Admittedly, having a datatype input function that inserts to > the table is worth a "huh?", but I'm feeling very confident that we can > catch all such cases, and some of them might even be sensible. Sure, but do we want to be that invasive? Triggers are easy enough to block because those are available directly within cstate so you would know if those are triggered. CHECK constraint can be also easily looked after by looking at the Relation information, and actually as DEFAULT values could have an expression we'd want to block them, no? The input datatype is well, more tricky to deal with as there is no actual way to know if the INSERT is happening within the context of a COPY and this could be just C code. One way to tackle that would be to enforce the optimization to not be used if a non-system data type is used when doing COPY... Disabling entirely the optimization for any relation which has a CHECK constraint or DEFAULT expression basically applies to a hell lot of them, which makes the optimization, at least it seems to me, useless because it is never going to apply to most real-world cases. -- Michael signature.asc Description: PGP signature
Re: patch to allow disable of WAL recycling
On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra wrote: > I don't follow Alvaro's reasoning, TBH. There's a couple of things that > confuse me ... > > I don't quite see how reusing WAL segments actually protects against full > filesystem? On "traditional" filesystems I would not expect any difference > between "unlink+create" and reusing an existing file. On CoW filesystems > (like ZFS or btrfs) the space management works very differently and reusing > an existing file is unlikely to save anything. Yeah, I had the same thoughts. > But even if it reduces the likelihood of ENOSPC, it does not eliminate it > entirely. max_wal_size is not a hard limit, and the disk may be filled by > something else (when WAL is not on a separate device, when there is think > provisioning, etc.). So it's not a protection against data corruption we > could rely on. (And as was discussed in the recent fsync thread, ENOSPC is a > likely source of past data corruption issues on NFS and possibly other > filesystems.) Right. That ENOSPC discussion was about checkpointing though, not WAL. IIUC the hypothesis was that there may be stacks (possibly involving NFS or thin provisioning, or perhaps historical versions of certain local filesystems that had reservation accounting bugs, on a certain kernel) that could let you write() a buffer, and then later when the checkpointer calls fsync() the filesystem says ENOSPC, the kernel reports that and throws away the dirty page, and then at next checkpoint fsync() succeeds but the checkpoint is a lie and the data is smoke. We already PANIC on any errno except EINTR in XLogWriteLog(), as seen in Jerry's nearby stack trace, so that failure mode seems to be covered already for WAL, no? > AFAICS the original reason for reusing WAL segments was the belief that > overwriting an existing file is faster than writing a new file. That might > have been true in the past, but the question is if it's still true on > current filesystems. The results posted here suggest it's not true on ZFS, > at least. Yeah. The wal_recycle=on|off patch seems reasonable to me (modulo Andres's comments about the documentation; we should make sure that the 'off' setting isn't accidentally recommended to the wrong audience) and I vote we take it. Just by the way, if I'm not mistaken ZFS does avoid faulting when overwriting whole blocks, just like other filesystems: https://github.com/freebsd/freebsd/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034 So then where are those faults coming from? Perhaps the tree page that holds the block pointer, of which there must be many when the recordsize is small? -- Thomas Munro http://www.enterprisedb.com
Re: Cannot dump foreign key constraints on partitioned table
On Thu, Jul 12, 2018 at 02:45:37PM -0400, Alvaro Herrera wrote: > Thanks, looks good. I propose to add following pg_dump test to ensure > this stays fixed. Thanks for adding the test. I was looking at a good way to add a test but could not come up with something which can be summed up with one query for create_sql, so what you have here is nice. Could you add an extra test with a partition of dump_test.test_table_fk? Children should have the FK defined as well with relhastriggers set to true, still when I tested if the partitioned was not scanned for its FK, then its children partition also missed it. So I think that it is important to check that the FK is defined for all members of the partition tree. I am fine to add the test myself and to push if you need help. Of course feel free to do it yourself if you want. Either way is fine for me. -- Michael signature.asc Description: PGP signature
Re: requested timeline ... does not contain minimum recovery point ...
On Thu, Jul 12, 2018 at 02:26:17PM -0700, Christophe Pettus wrote: > What surprises me about the error is that while the recovery point > seems reasonable, it shouldn't be on timeline 103, but on timeline > 105. Wild guess: you did not issue a checkpoint on the promoted standby before running pg_rewind. -- Michael signature.asc Description: PGP signature
Re: pg_create_logical_replication_slot returns text instead of name
On Thu, Jul 12, 2018 at 10:18:53PM +0900, Michael Paquier wrote: > That's a small thing, but I agree with you. As far as I can see slot > names are always mapped with the name type. I'll push that tomorrow if > there are no objections. Pushed, with a catalog version bump. While double-checking things, I have noticed that pg_stat_wal_receiver also uses text for slot names. I am not sure if this one is worth bothering as the docs point out the correct type, just mentioning. -- Michael signature.asc Description: PGP signature
RE: ON CONFLICT DO NOTHING on pg_dump
Hi, thanks for the revision. > >+Add ON CONFLICT DO NOTHING clause in the INSERT commands. > >I think this would be better as: Add ON CONFLICT DO NOTHING >to >INSERT commands. Agreed. >+printf(_(" --on-conflict-do-nothing dump data as INSERT >commands with ON CONFLICT DO NOTHING \n")); > >That's slightly misleading... let's just use the same wording again, eg "add ON >CONFLICT DO NOTHING to INSERT commands". Agreed. But you forgot fixing it at pg_dump.c. So could you please fix this and commit it? Regards, Takeshi Ideriha
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/12/2018 06:34 PM, Alvaro Herrera wrote: On 2018-Jul-12, Andrew Dunstan wrote: I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. Well, I understood Claudio was going to do some more work (see upthread). If we're going to go with Heikki's patch then do we need to change the author, or add him as an author? cheers andew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On 2018-Jul-12, Andrew Dunstan wrote: > I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Shared buffer access rule violations?
Asim R P writes: > On Tue, Jul 10, 2018 at 8:33 PM, Tom Lane wrote: >> Asim R P writes: >>> One can find several PageInit() calls with no content lock held. See, >>> for example: >>> fill_seq_with_data() >> That would be for a relation that no one else can even see yet, no? > Yes, when the sequence is being created. No, when the sequence is > being reset, in ResetSequence(). ResetSequence creates a new relfilenode, which no one else will be able to see until it commits, so the case is effectively the same as for creation. >>> vm_readbuf() >>> fsm_readbuf() >> In these cases I'd imagine that the I/O completion interlock is what >> is preventing other backends from accessing the buffer. > What is I/O completion interlock? Oh ... the RBM_ZERO_ON_ERROR action should be done under the I/O lock, but the ReadBuffer caller isn't holding that lock anymore, so I see your point here. Probably, nobody's noticed because it's a corner case that shouldn't happen under normal use, but it's not safe. I think what we want is more like if (PageIsNew(BufferGetPage(buf))) { LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); if (PageIsNew(BufferGetPage(buf))) PageInit(BufferGetPage(buf), BLCKSZ, 0); UnlockReleaseBuffer(buf); } to ensure that the page is initialized once and only once, even if several backends do this concurrently. regards, tom lane
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/12/2018 12:38 PM, Claudio Freire wrote: On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan wrote: On 04/06/2018 08:00 PM, Claudio Freire wrote: On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire wrote: On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas wrote: On 06/04/18 01:59, Claudio Freire wrote: The iteration interface, however, seems quite specific for the use case of vacuumlazy, so it's not really a good abstraction. Can you elaborate? It does return the items one block at a time. Is that what you mean by being specific for vacuumlazy? I guess that's a bit special, but if you imagine some other users for this abstraction, it's probably not that unusual. For example, if we started using it in bitmap heap scans, a bitmap heap scan would also want to get the TIDs one block number at a time. But you're also tying the caller to the format of the buffer holding those TIDs, for instance. Why would you, when you can have an interface that just iterates TIDs and let the caller store them if/however they want? I do believe a pure iterator interface is a better interface. Between the b-tree or not discussion and the refactoring to separate the code, I don't think we'll get this in the next 24hs. So I guess we'll have ample time to poner on both issues during the next commit fest. There doesn't seem to have been much pondering done since then, at least publicly. Can we make some progress on this? It's been around for a long time now. Yeah, life has kept me busy and I haven't had much time to make progress here, but I was planning on doing the refactoring as we were discussing soon. Can't give a time frame for that, but "soonish". I fully understand. I think this needs to go back to "Waiting on Author". cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
Indeed… but then throttling would not be tested:-) The point of the test is to exercise all time-related options, including throttling with a reasonable small value. Ok. I don't think that's really worthwhile. If we add some code that only runs in testing, then we're not really testing the real thing. I wouldn't trust the test to tell much. Let's just leave out that magic environment variable thing, and try to get the rest of the patch finished. If you remove the environment, then some checks need to be removed, because the 2 second run may be randomly shorten when there is nothing to do. If not, the test will fail underterminiscally, which is not acceptable. Hence the hack. I agree that it is not beautiful. The more reasonable alternative could be to always last 2 seconds under -T 2, even if the execution can be shorten because there is nothing to do at all, i.e. remove the environment-based condition but keep the sleep. Yet another option would be to replace the env variable by an option, eg "--strict-time", that would be used probaly only by the TAP test, but would be an available feature. -- Fabien.
Re: Temporary WAL segments files not cleaned up after an instance crash
On Thu, Jul 12, 2018 at 03:40:43PM +0300, Heikki Linnakangas wrote: > Sure. Thanks for the reviews, I have pushed the patch after moving the elog() call and changing the logs to mention "WAL segments" instead of "WAL files". -- Michael signature.asc Description: PGP signature
Re: requested timeline ... does not contain minimum recovery point ...
> On Jul 12, 2018, at 10:29, Andres Freund wrote: > > This needs a lot more information before somebody can reasonably act on > it. Happy to provide, of course! The two relevant hosts are "Ash" and "Chi". As mentioned, they've been flipped back and forth repeatedly using pg_rewind: One will be promoted, the other pg_rewind'd, and then brought up as a WAL shipping secondary to the first. This procedure has worked repeatedly, but this last instance failed. The logs from the failure are attached below, along with the output of from pg_controldata for both hosts. The systems are still in this configuration, so we can gather more as required. What surprises me about the error is that while the recovery point seems reasonable, it shouldn't be on timeline 103, but on timeline 105. Failover to Ash: 2018-07-10 19:14:22 UTC [2072]: [5153-1] user=,db=,app=,client= LOG: received promote request 2018-07-10 19:14:22 UTC [2072]: [5154-1] user=,db=,app=,client= LOG: redo done at A58/4F0822D0 2018-07-10 19:14:22 UTC [2072]: [5155-1] user=,db=,app=,client= LOG: last completed transaction was at log time 2018-07-10 19:13:54.515121+00 2018-07-10 19:14:23 UTC [2072]: [5156-1] user=,db=,app=,client= LOG: restored log file "00670A58004F" from archive 2018-07-10 19:14:23 UTC [2072]: [5157-1] user=,db=,app=,client= LOG: selected new timeline ID: 104 2018-07-10 19:14:24 UTC [2072]: [5158-1] user=,db=,app=,client= LOG: restored log file "0067.history" from archive [Chi rewound using pg_rewind against Ash, brought up] Chi: 2018-07-10 19:26:05 UTC [3260]: [4-1] user=,db=,app=,client= LOG: restored log file "0068.history" from archive 2018-07-10 19:26:06 UTC [3260]: [5-1] user=,db=,app=,client= LOG: redo starts at A58/4E061EF8 2018-07-10 19:26:07 UTC [3260]: [6-1] user=,db=,app=,client= LOG: restored log file "00680A580050" from archive 2018-07-10 19:26:08 UTC [3260]: [7-1] user=,db=,app=,client= LOG: restored log file "00680A580051" from archive 2018-07-10 19:26:09 UTC [3260]: [8-1] user=,db=,app=,client= LOG: restored log file "00680A580052" from archive 2018-07-10 19:26:10 UTC [3260]: [9-1] user=,db=,app=,client= LOG: restored log file "00680A580053" from archive 2018-07-10 19:26:11 UTC [3260]: [10-1] user=,db=,app=,client= LOG: restored log file "00680A580054" from archive 2018-07-10 19:26:12 UTC [3260]: [11-1] user=,db=,app=,client= LOG: restored log file "00680A580055" from archive 2018-07-10 19:26:13 UTC [3260]: [12-1] user=,db=,app=,client= LOG: restored log file "00680A580056" from archive 2018-07-10 19:26:13 UTC [3260]: [13-1] user=,db=,app=,client= LOG: restored log file "00680A580057" from archive 2018-07-10 19:26:14 UTC [3260]: [14-1] user=,db=,app=,client= LOG: restored log file "00680A580058" from archive 2018-07-10 19:26:15 UTC [3260]: [15-1] user=,db=,app=,client= LOG: restored log file "00680A580059" from archive 2018-07-10 19:26:15 UTC [3260]: [16-1] user=,db=,app=,client= LOG: restored log file "00680A58005A" from archive 2018-07-10 19:26:16 UTC [3260]: [17-1] user=,db=,app=,client= LOG: restored log file "00680A58005B" from archive 2018-07-10 19:26:17 UTC [3260]: [18-1] user=,db=,app=,client= LOG: restored log file "00680A58005C" from archive 2018-07-10 19:27:28 UTC [3260]: [19-1] user=,db=,app=,client= LOG: restored log file "00680A58005D" from archive 2018-07-10 19:27:28 UTC [3260]: [20-1] user=,db=,app=,client= LOG: consistent recovery state reached at A58/5D01BCA8 2018-07-10 19:27:28 UTC [2564]: [3-1] user=,db=,app=,client= LOG: database system is ready to accept read only connections 2018-07-10 19:28:28 UTC [3260]: [21-1] user=,db=,app=,client= LOG: restored log file "00680A58005E" from archive Chi promoted: 2018-07-10 19:28:37 UTC [3260]: [22-1] user=,db=,app=,client= LOG: received promote request 2018-07-10 19:28:37 UTC [3260]: [23-1] user=,db=,app=,client= LOG: redo done at A58/5E0817D0 2018-07-10 19:28:37 UTC [3260]: [24-1] user=,db=,app=,client= LOG: last completed transaction was at log time 2018-07-10 19:28:12.954559+00 2018-07-10 19:28:37 UTC [3260]: [25-1] user=,db=,app=,client= LOG: restored log file "00680A58005E" from archive 2018-07-10 19:28:37 UTC [3260]: [26-1] user=,db=,app=,client= LOG: selected new timeline ID: 105 2018-07-10 19:28:38 UTC [3260]: [27-1] user=,db=,app=,client= LOG: restored log file "0068.history" from archive 2018-07-10 19:28:38 UTC [3260]: [28-1] user=,db=,app=,client= LOG: archive recovery complete 2018-07-10 19:28:38 UTC [3260]: [29-1] user=,db=,app=,client= LOG: MultiXact member wraparound protections are now enabled 2018-07-10 19:28:38 UTC [5068]: [1-1] user=,db=,app=,client= LOG: checkpoint starting: force 2018-07-10 19:28:38 UTC [2564]: [4-1] user=,db=,app=,client= LO
Re: When use prepared protocol, transaction will hold backend_xmin until the end of the transaction.
chenhj writes: > When execute sql with prepared protocol, read committed transaction will hold > backend_xmin until the end of the transaction. No, just till the active portal is dropped. In the case you show, the issue is that libpq doesn't bother to issue an explicit Close Portal message, but just lets the unnamed portal get recycled implicitly by the next query (cf. PQsendQueryGuts). So the portal stays open, and its snapshot stays alive, till some other command is sent. This is different from the behavior for simple query mode, where the portal is automatically closed after execution. I agree this isn't very desirable now that we have mechanisms to advance the advertised xmin as soon as snapshots go away. Perhaps portals could be taught to drop their snapshots as soon as the query has reached completion, but it'd be a little bit ticklish to not break valid use-patterns for cursors. Another idea would be to fix it on the client side by including an explicit Close command in the PQsendQuery sequence. But if there are similar usage patterns in other client libraries, it might take a long time to get them all up to speed. regards, tom lane
Re: Failed assertion due to procedure created with SECURITY DEFINER option
> On Jul 4, 2018, at 3:43 AM, Peter Eisentraut > wrote: > > On 03.07.18 19:20, Andres Freund wrote: >> On 2018-06-29 10:19:17 -0700, Andres Freund wrote: >>> Hi, >>> >>> On 2018-06-29 13:56:12 +0200, Peter Eisentraut wrote: On 6/29/18 13:07, amul sul wrote: > This happens because of in fmgr_security_definer() function we are > changing global variable SecurityRestrictionContext and in the > StartTransaction() insisting it should be zero, which is the problem. Hmm, what is the reason for this insistation? >>> >>> Because it's supposed to be reset by AbortTransaction(), after an error. >> >> Does that make sense Peter? >> >> I've added this thread to the open items list. > > Proposed fix attached. First, reproduced the issue against HEAD and was able to successfully do so. Then, applied the patch and tested using original test case: testing=# CREATE PROCEDURE transaction_test1() LANGUAGE plpgsql SECURITY DEFINER testing-# AS $$ BEGIN testing$#COMMIT; testing$# END $$; CREATE PROCEDURE testing=# CALL transaction_test1(); 2018-07-12 15:45:49.846 EDT [39928] ERROR: invalid transaction termination 2018-07-12 15:45:49.846 EDT [39928] CONTEXT: PL/pgSQL function transaction_test1() line 2 at COMMIT 2018-07-12 15:45:49.846 EDT [39928] STATEMENT: CALL transaction_test1(); ERROR: invalid transaction termination CONTEXT: PL/pgSQL function transaction_test1() line 2 at COMMIT So it handles it as expected. Code and test cases look fine to me from what I can tell. My only suggestion would be if we could add some guidance to the documentation on what languages can support transaction control statements inside procedures with a SECURITY DEFINER. Jonathan
Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints
Nico Williams writes: > Attached is an additional patch, as well as a new, rebased patch. > > This includes changes responsive to Álvaro Herrera's commentary about > the SET CONSTRAINTS manual page. This patch looks good to me. +1; Álvaro, please update the CF entry when you're also satisfied. Thanks, --Robbie signature.asc Description: PGP signature
Re: [WIP] [B-Tree] Retail IndexTuple deletion
On Tue, Jul 3, 2018 at 5:17 AM, Andrey V. Lepikhov wrote: > Done. > Attachment contains an update for use v.2 of the 'Ensure nbtree leaf tuple > keys are always unique' patch. My v3 is still pending, but is now a lot better than v2. There were bugs in v2 that were fixed. One area that might be worth investigating is retail index tuple deletion performed within the executor in the event of non-HOT updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record", at least in unique index tuples with no NULL values. The idea is that MVCC index scans can skip over those if they've already found a visible tuple with the same value. Also, when there was about to be a page split, they could be treated a little bit like LP_DEAD items. Of course, the ghost bit would have to be treated as a hint that could be "wrong" (e.g. because the transaction hasn't committed yet), so you'd have to go to the heap in the context of a page split, to double check. Also, you'd need heuristics that let you give up on this strategy when it didn't help. I think that this could work well enough for OLTP workloads, and might be more future-proof than doing it in VACUUM. Though, of course, it's still very complicated. -- Peter Geoghegan
Re: Cache invalidation after authentication (on-the-fly role creation)
Thomas Munro writes: > On Thu, Jul 5, 2018 at 9:35 AM, Tom Lane wrote: >> That seems like a *really* ad-hoc place to put it. Why should it be >> there, and not (say) somewhere inside InitializeSessionUserId, or maybe >> (also?) inside PerformAuthentication? Why do the existing call sites for >> AcceptInvalidationMessages --- in particular, the ones associated with >> lock acquisition --- not solve the problem already? > There is no lock acquisition involved here. The sequence is: > > 1. ClientAuthentication()->hba_getauthmethod()->check_hba()->get_role_oid() > tries to look up user "fred" and doesn't find it (that lookup is used > for group membership checks for hba line matching purposes, and not > finding it here is not fatal, assuming you match with "all"); the > cache now has a negative entry. > > 2. The configured authentication method runs, and via a side > connection it creates the role "fred". > > 3. InitializeSessionUserId() looks up "fred", and finds the stale > negative entry. [ thinks about that for awhile... ] So really this is an instance of a generic problem, which is that the result of a syscache lookup could be stale: it's not guaranteed to reflect changes that committed since the last AcceptInvalidationMessages call, which in the worst case is the start of the current transaction. We have workarounds in place that (mostly?) guarantee that relation-related lookups will get sufficiently up-to-date data, but there's nothing much protecting other catalogs such as pg_proc or pg_authid. I can't help thinking we're going to need to fix that eventually. But I can't think of any easy way to do so. Adding AcceptInvalidationMessages to SearchSysCache itself is right out for performance reasons, I'm afraid, unless we can somehow find a way to make the no-op path through it much cheaper than it is now. > I suppose the call to AcceptInvalidationMessages() could go at the end > of ClientAuthentication(). That'd be closer to the code that creates > the negative entry and immediately after the code that might modify > the catalogs. Or in PeformAuthentication(), its caller. Or in > IntializeSessionUserId() immediately before we try to look up the > name, but that's also called by background workers that don't need it. I think my preference would be to put it in InitializeSessionUserId so that it's clearly associated with the syscache lookups we're trying to protect. I don't entirely buy your argument that background workers don't need up-to-date info for this. regards, tom lane
Re: Cannot dump foreign key constraints on partitioned table
On 2018-Jul-12, Michael Paquier wrote: > Changing pg_class.relhastriggers is out of scope because as far as I > know partitioned tables have no triggers, so the current value is > correct, and that would be a catalog change at this stage which would > cause any existing deployments of v11 to complain about the > inconsistency. I think that this should be fixed client-side as the > attached does. Thanks, looks good. I propose to add following pg_dump test to ensure this stays fixed. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 05b2d8912688ce6b1c613d7de8a0a3a874e21653 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 12 Jul 2018 14:36:11 -0400 Subject: [PATCH] Dump foreign keys on partitioned tables MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit My patch that ended up as commit 3de241dba86f ("Foreign keys on partitioned tables") lacked pg_dump tests, so the pg_dump code that was there to support it inadvertently stopped working when I hacked the backend code to not emit pg_trigger rows for the partitioned table itself. Bug analysis and code fix is by Michaël. I (Álvaro) merely added the test. Reported-by: amul sul Co-authored-by: Michaël Paquier Co-authored-by: Álvaro Herrera Discussion: https://postgr.es/m/CAAJ_b94n=UsNVhgs97vCaWEZAMe-tGDRVuZ73oePQH=eajk...@mail.gmail.com --- src/bin/pg_dump/pg_dump.c| 7 ++- src/bin/pg_dump/t/002_pg_dump.pl | 18 ++ 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 463639208d..74a1270169 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -7131,7 +7131,12 @@ getConstraints(Archive *fout, TableInfo tblinfo[], int numTables) { TableInfo *tbinfo = &tblinfo[i]; - if (!tbinfo->hastriggers || + /* +* For partitioned tables, foreign keys have no triggers so they +* must be included anyway in case some foreign keys are defined. +*/ + if ((!tbinfo->hastriggers && +tbinfo->relkind != RELKIND_PARTITIONED_TABLE) || !(tbinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)) continue; diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index 7eee870259..8860928df1 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -631,6 +631,24 @@ my %tests = ( }, }, + 'ALTER TABLE (partitioned) ADD CONSTRAINT ... FOREIGN KEY' => { + create_order => 4, + create_sql => 'CREATE TABLE dump_test.test_table_fk ( + col1 int references dump_test.test_table) + PARTITION BY RANGE (col1);', + regexp => qr/ + \QADD CONSTRAINT test_table_fk_col1_fkey FOREIGN KEY (col1) REFERENCES dump_test.test_table\E + /xm, + like => { + %full_runs, + %dump_test_schema_runs, + section_post_data => 1, + }, + unlike => { + exclude_dump_test_schema => 1, + }, + }, + 'ALTER TABLE ONLY test_table ALTER COLUMN col1 SET STATISTICS 90' => { create_order => 93, create_sql => -- 2.11.0
Re: assert in nested SQL procedure call in current HEAD
On 12.06.18 18:47, Andrew Gierth wrote: > While testing this, I ran into another semi-related issue: > shmem_exit_inprogress isn't ever being cleared in the postmaster, which > means that if you ever have a crash-restart, any attempt to do a > rollback in a procedure will then crash or get some other form of > corruption again every time until you manually restart the cluster. I have committed a fix for this. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
I don't understand the 0.5 second rule. For the tests, we only need to ensure that at least one progress report is printed, right? [...] I still don't understand. Let's look at the code: if (progress && thread->tid == 0) { ... if (last_report == thread_start || now - last_report >= 50) doProgressReport(thread, &last, &last_report, now, thread_start); For the testing, we just need to make sure that at least one progress report is always printed, if -P is used. Right? Yep. That is the first condition above the last_report is set to thread_start meaning that there has been no report. So where does the 0.5 second rule come in? Can't we just do "if (no progress reports were printed) { print progress report; }" at the end? The second 0.5s condition is to print a closing report if some time significant time passed since the last one, but we do not want to print a report at the same second. pgbench -T 5 -P 2 Would then print report at 2, 4 and 5. 0.5 ensures that we are not going to do "2 4.0[00] 4.0[01]" on -t whatever -P 2, which would look silly. If you do not like it then the second condition can be removed, fine with me. It also adds a small feature which is that there is always a final progress when the run is completed, which can be useful when computing progress statistics, otherwise some transactions could not be reported in any progress at all. Any transactions in the last 0.5 seconds might still not get reported in any progress reports. Yep. I wanted a reasonable threshold, given that both -T and -P are in seconds anyway, so it probably could only happen with -t. Indeed… but then throttling would not be tested:-) The point of the test is to exercise all time-related options, including throttling with a reasonable small value. Ok. I don't think that's really worthwhile. If we add some code that only runs in testing, then we're not really testing the real thing. I wouldn't trust the test to tell much. Let's just leave out that magic environment variable thing, and try to get the rest of the patch finished. If you remove the environment, then some checks need to be removed, because the 2 second run may be randomly shorten when there is nothing to do. If not, the test will fail underterminiscally, which is not acceptable. Hence the hack. I agree that it is not beautiful. The more reasonable alternative could be to always last 2 seconds under -T 2, even if the execution can be shorten because there is nothing to do at all, i.e. remove the environment-based condition but keep the sleep. -- Fabien.
Re: requested timeline ... does not contain minimum recovery point ...
Hi, On 2018-07-12 10:20:06 -0700, Christophe Pettus wrote: > PostgreSQL 9.6.9, Windows Server 2012 Datacenter (64-bit). > > We're trying to diagnose the error: > > requested timeline 105 does not contain minimum recovery point > A58/6B109F28 on timeline 103 > > The error occurs when a WAL-shipping (not streaming) secondary starts up. > > These two machines have been part of a stress-test where, repeatedly, the > secondary is promoted, the old primary is rewound using pg_rewind, and then > attached to the new primary. This has worked for multiple iterations, but > this error popped up. The last cycle was particularly fast: the new primary > was only up for about 10 seconds (although it had completed recovery) before > being shut down again, and pg_rewind applied to it to reconnect it with the > promoted secondary. This needs a lot more information before somebody can reasonably act on it. Greetings, Andres Freund
requested timeline ... does not contain minimum recovery point ...
PostgreSQL 9.6.9, Windows Server 2012 Datacenter (64-bit). We're trying to diagnose the error: requested timeline 105 does not contain minimum recovery point A58/6B109F28 on timeline 103 The error occurs when a WAL-shipping (not streaming) secondary starts up. These two machines have been part of a stress-test where, repeatedly, the secondary is promoted, the old primary is rewound using pg_rewind, and then attached to the new primary. This has worked for multiple iterations, but this error popped up. The last cycle was particularly fast: the new primary was only up for about 10 seconds (although it had completed recovery) before being shut down again, and pg_rewind applied to it to reconnect it with the promoted secondary. -- -- Christophe Pettus x...@thebuild.com
Re: Cannot dump foreign key constraints on partitioned table
On 2018-Jul-12, Michael Paquier wrote: > On Wed, Jul 11, 2018 at 03:49:59PM +0530, amul sul wrote: > > On the master head, getConstraints() function skips FK constraints for > > a partitioned table because of tbinfo->hastriggers is false. > > > > While creating FK constraints on the partitioned table, the FK triggers are > > only > > created on leaf partitions and skipped for the partitioned tables. > > Oops. Good catch. Looking at it now. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: GiST VACUUM
> 12 июля 2018 г., в 20:40, Heikki Linnakangas написал(а): > > On 12/07/18 19:06, Andrey Borodin wrote: >>> 11 июля 2018 г., в 0:07, Heikki Linnakangas >>> написал(а): >>> This seems misplaced. This code deals with internal pages, and as >>> far as I can see, this patch never marks internal pages as deleted, >>> only leaf pages. However, we should have something like this in the >>> leaf-page branch, to deal with the case that an insertion lands on >>> a page that was concurrently deleted. Did you have any tests, where >>> an insertion runs concurrently with vacuum, that would exercise >>> this? >> That bug could manifest only in case of crash between removing >> downlinks and marking pages deleted. > > Hmm. The downlink is removed first, so I don't think you can see that > situation after a crash. After a crash, you might have some empty, orphaned, > pages that have already been unlinked from the parent, but a search/insert > should never encounter them. > > Actually, now that I think about it more, I'm not happy with leaving orphaned > pages like that behind. Let's WAL-log the removal of the downlink, and > marking the leaf pages as deleted, in one WAL record, to avoid that. OK, will do this. But this will complicate WAL replay seriously, and I do not know a proper way to test that (BTW there is GiST amcheck in progress, but I decided to leave it for a while). > > But the situation in gistdoinsert(), where you encounter a deleted leaf page, > could happen during normal operation, if vacuum runs concurrently with an > insert. Insertion locks only one page at a time, as it descends the tree, so > after it has released the lock on the parent, but before it has locked the > child, vacuum might have deleted the page. In the latest patch, you're > checking for that just before swapping the shared lock for an exclusive one, > but I think that's wrong; you need to check for that after swapping the lock, > because otherwise vacuum might delete the page while you're not holding the > lock. Looks like a valid concern, I'll move that code again. > >> I do not know how to test this >> reliably. Internal pages are locked before leafs and locks are >> coupled. No cuncurrent backend can see downlinks to pages being >> deleted, unless crash happens. > > Are you sure? At a quick glance, I don't think the locks are coupled. Sorry for overquoting + /* rescan inner pages that had empty child pages */ + foreach(cell,rescanList) + { + Buffer buffer; + Pagepage; + OffsetNumber i, + maxoff; + IndexTuple idxtuple; + ItemId iid; + OffsetNumber todelete[MaxOffsetNumber]; + Buffer buftodelete[MaxOffsetNumber]; + int ntodelete = 0; + + buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell), + RBM_NORMAL, info->strategy); + LockBuffer(buffer, GIST_EXCLUSIVE); Here's first lock + gistcheckpage(rel, buffer); + page = (Page) BufferGetPage(buffer); + + Assert(!GistPageIsLeaf(page)); + + maxoff = PageGetMaxOffsetNumber(page); + + for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i)) + { + Buffer leafBuffer; + PageleafPage; + + iid = PageGetItemId(page, i); + idxtuple = (IndexTuple) PageGetItem(page, iid); + + leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)), + RBM_NORMAL, info->strategy); + LockBuffer(leafBuffer, GIST_EXCLUSIVE); now locks are coupled in a top-down descent > > We do need some way of testing this.. Can we test replication of concurrent VACUUM and inserts in existing test suit? I just do not know. I can do this tests manually if this is enough. Best regards, Andrey Borodin.
Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian
On 2018-Jul-12, Tom Lane wrote: > Andres Freund writes: > > On 2018-06-29 18:17:08 -0400, Tom Lane wrote: > >> I'm on vacation and won't have time to look at this until week after > >> next. If you don't mind putting the topic on hold that long, I'll > >> be happy to take responsibility for it. > > > Is that still the plan? Do you forsee any issues here? This has been > > somewhat of a longstanding open item... > > It's on my to-do list, but I'm still catching up vacation backlog. > Is this item blocking anyone? I don't think so. The patch might conflict with other fixes, but I suppose that's a fact of life. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: small doc fix - using expressions in plpgsql FETCH command
2018-07-12 18:29 GMT+02:00 Tom Lane : > Pavel Stehule writes: > > PLpgSQL FETCH documentation is has ref on SQL FETCH command. SQL FETCH > > allows only int constants as count. PLpgSQL allows any expressions. In > this > > case documentation is not clear, and people can be messy - and apply SQL > > FETCH limits on PLpgSQL FETCH. > > Right. Pushed with some rewriting. > Thank you Pavel > regards, tom lane >
Re: Fix error message when trying to alter statistics on included column
Hi Alexander, Teodor, On 2018-06-28 18:28:03 +0900, Yugo Nagata wrote: > According to the error message, it is not allowed to alter statistics on > included column because this is "non-expression column". > > postgres=# create table test (i int, d int); > CREATE TABLE > postgres=# create index idx on test(i) include (d); > CREATE INDEX > postgres=# alter index idx alter column 2 set statistics 10; > ERROR: cannot alter statistics on non-expression column "d" of index "idx" > HINT: Alter statistics on table column instead. Is either of you going to take care of this one? IIRC Teodor committed the underlying patch, and Alexander wrote parts of it? Greetings, Andres Freund
Re: patch to allow disable of WAL recycling
I was asked to perform two different tests: 1) A benchmarksql run with WAL recycling on and then off, for comparison 2) A test when the filesystem fills up For #1, I did two 15 minute benchmarksql runs and here are the results. wal_recycle=on -- Term-00, Running Average tpmTOTAL: 299.84Current tpmTOTAL: 29412 Memory U14:49:02,470 [Thread-1] INFO jTPCC : Term-00, 14:49:02,470 [Thread-1] INFO jTPCC : Term-00, 14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Measured tpmC (NewOrders) = 136.49 14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Measured tpmTOTAL = 299.78 14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Session Start = 2018-07-12 14:34:02 14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Session End = 2018-07-12 14:49:02 14:49:02,471 [Thread-1] INFO jTPCC : Term-00, Transaction Count = 4497 wal_recycle=off --- Term-00, Running Average tpmTOTAL: 299.85Current tpmTOTAL: 29520 Memory U15:10:15,712 [Thread-1] INFO jTPCC : Term-00, 15:10:15,712 [Thread-1] INFO jTPCC : Term-00, 15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Measured tpmC (NewOrders) = 135.89 15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Measured tpmTOTAL = 299.79 15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Session Start = 2018-07-12 14:55:15 15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Session End = 2018-07-12 15:10:15 15:10:15,713 [Thread-1] INFO jTPCC : Term-00, Transaction Count = 4497 As can be seen, disabling WAL recycling does not cause any performance regression. For #2, I ran the test with WAL recycling on (the current behavior as well as the default with this patch) since the behavior of postgres is orthogonal to WAL recycling when the filesystem fills up. I capped the filesystem with 32MB of free space. I setup a configuration with wal_keep_segments=50 and started a long benchmarksql run. I had 4 WAL files already in existence when the run started. As the filesystem fills up, the performance of postgres gets slower and slower, as would be expected. This is due to the COW nature of the filesystem and the fact that all writes need to find space. When a new WAL file is created, this essentially consumes no space since it is a zero-filled file, so no filesystem space is consumed, except for a little metadata for the file. However, as writes occur to the WAL file, space is being consumed. Eventually all space in the filesystem is consumed. I could not tell if this occurred during a write to an existing WAL file or a write to the database itself. As other people have observed, WAL file creation in a COW filesystem is not the problematic operation when the filesystem fills up. It is the writes to existing files that will fail. When postgres core dumped there were 6 WAL files in the pg_wal directory (well short of the 50 configured). When the filesystem filled up, postgres core dumped and benchmarksql emitted a bunch of java debug information which I could provide if anyone is interested. Here is some information for the postgres core dump. It looks like postgres aborted itself, but since the filesystem is full, there is nothing in the log file. > ::status debugging core file of postgres (64-bit) from f6c22f98-38aa-eb51-80d2-811ed25bed6b file: /zones/f6c22f98-38aa-eb51-80d2-811ed25bed6b/local/pgsql/bin/postgres initial argv: /usr/local/pgsql/bin/postgres -D /home/postgres/data threading model: native threads status: process terminated by SIGABRT (Abort), pid=76019 uid=1001 code=-1 > $C f9dfa4b0 libc.so.1`_lwp_kill+0xa() f9dfa4e0 libc.so.1`raise+0x20(6) f9dfa530 libc.so.1`abort+0x98() f9dfa560 errfinish+0x230() f9dfa5e0 XLogWrite+0x294() f9dfa610 XLogBackgroundFlush+0x18d() f9dfaa50 WalWriterMain+0x1a8() f9dfaab0 AuxiliaryProcessMain+0x3ff() f9dfab40 0x7b5566() f9dfab90 reaper+0x60a() f9dfaba0 libc.so.1`__sighndlr+6() f9dfac30 libc.so.1`call_user_handler+0x1db(12, 0, f9dfaca0) f9dfac80 libc.so.1`sigacthandler+0x116(12, 0, f9dfaca0) f9dfb0f0 libc.so.1`__pollsys+0xa() f9dfb220 libc.so.1`pselect+0x26b(7, f9dfdad0, 0, 0, f9dfb230, 0) f9dfb270 libc.so.1`select+0x5a(7, f9dfdad0, 0, 0, f9dfb6c0) f9dffb00 ServerLoop+0x289() f9dffb70 PostmasterMain+0xcfa() f9dffba0 main+0x3cd() f9dffbd0 _start_crt+0x83() f9dffbe0 _start+0x18() Let me know if there is any other information I could provide. Thanks, Jerry On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek wrote: > Hello All, > > Attached is a patch to provide an option to disable WAL recycling. We have > found that this can help performance by eliminating read-modify-write > behavior on old WAL files that are no longer resident in the filesystem > cache. The is a lot more detail on the background of the motivation for > this in the following thread. > > https://www.postgresql.org/message-id/flat/CACukRjO7
Re: Concurrency bug in UPDATE of partition-key
On 2018-07-11 09:16:33 -0400, Alvaro Herrera wrote: > On 2018-Jul-11, Amit Kapila wrote: > > > Attached, please find an updated patch based on comments by Alvaro. > > See, if this looks okay to you guys. > > LGTM as far as my previous comments are concerned. I see Amit pushed a patch here yesterday https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=40ca70ebcc9d0bc1c02937b27c44b2766617e790 , is there a need to keep the open item open, or is this the complete fix? Greetings, Andres Freund
Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian
Andres Freund writes: > On 2018-06-29 18:17:08 -0400, Tom Lane wrote: >> I'm on vacation and won't have time to look at this until week after >> next. If you don't mind putting the topic on hold that long, I'll >> be happy to take responsibility for it. > Is that still the plan? Do you forsee any issues here? This has been > somewhat of a longstanding open item... It's on my to-do list, but I'm still catching up vacation backlog. Is this item blocking anyone? regards, tom lane
Re: assert in nested SQL procedure call in current HEAD
Hi, On 2018-06-29 13:52:23 +0200, Peter Eisentraut wrote: > From 95fc7156afe521b715fab08d44606774df875e92 Mon Sep 17 00:00:00 2001 > From: Peter Eisentraut > Date: Fri, 29 Jun 2018 13:28:39 +0200 > Subject: [PATCH] Fix assert in nested SQL procedure call Andrew, Peter, are you happy with this? It'd be good to close this open item. - Andres
Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian
Hi Tom, On 2018-06-29 18:17:08 -0400, Tom Lane wrote: > Alvaro Herrera writes: > > Since Tom has been revamping this code lately, I think it's a good > > idea to wait for his input. > > I'm on vacation and won't have time to look at this until week after > next. If you don't mind putting the topic on hold that long, I'll > be happy to take responsibility for it. Is that still the plan? Do you forsee any issues here? This has been somewhat of a longstanding open item... Greetings, Andres Freund
Re: GiST VACUUM
On 12/07/18 19:06, Andrey Borodin wrote: 11 июля 2018 г., в 0:07, Heikki Linnakangas написал(а): This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal pages as deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the case that an insertion lands on a page that was concurrently deleted. Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this? That bug could manifest only in case of crash between removing downlinks and marking pages deleted. Hmm. The downlink is removed first, so I don't think you can see that situation after a crash. After a crash, you might have some empty, orphaned, pages that have already been unlinked from the parent, but a search/insert should never encounter them. Actually, now that I think about it more, I'm not happy with leaving orphaned pages like that behind. Let's WAL-log the removal of the downlink, and marking the leaf pages as deleted, in one WAL record, to avoid that. But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock. I do not know how to test this reliably. Internal pages are locked before leafs and locks are coupled. No cuncurrent backend can see downlinks to pages being deleted, unless crash happens. Are you sure? At a quick glance, I don't think the locks are coupled. We do need some way of testing this.. - Heikki
Re: Vacuum: allow usage of more than 1GB of work mem
On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan wrote: > > > > On 04/06/2018 08:00 PM, Claudio Freire wrote: > > On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire > > wrote: > >> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas > >> wrote: > >>> On 06/04/18 01:59, Claudio Freire wrote: > The iteration interface, however, seems quite specific for the use > case of vacuumlazy, so it's not really a good abstraction. > >>> > >>> Can you elaborate? It does return the items one block at a time. Is that > >>> what you mean by being specific for vacuumlazy? I guess that's a bit > >>> special, but if you imagine some other users for this abstraction, it's > >>> probably not that unusual. For example, if we started using it in bitmap > >>> heap scans, a bitmap heap scan would also want to get the TIDs one block > >>> number at a time. > >> But you're also tying the caller to the format of the buffer holding > >> those TIDs, for instance. Why would you, when you can have an > >> interface that just iterates TIDs and let the caller store them > >> if/however they want? > >> > >> I do believe a pure iterator interface is a better interface. > > Between the b-tree or not discussion and the refactoring to separate > > the code, I don't think we'll get this in the next 24hs. > > > > So I guess we'll have ample time to poner on both issues during the > > next commit fest. > > > > > > There doesn't seem to have been much pondering done since then, at least > publicly. Can we make some progress on this? It's been around for a long > time now. Yeah, life has kept me busy and I haven't had much time to make progress here, but I was planning on doing the refactoring as we were discussing soon. Can't give a time frame for that, but "soonish".
Re: small doc fix - using expressions in plpgsql FETCH command
Pavel Stehule writes: > PLpgSQL FETCH documentation is has ref on SQL FETCH command. SQL FETCH > allows only int constants as count. PLpgSQL allows any expressions. In this > case documentation is not clear, and people can be messy - and apply SQL > FETCH limits on PLpgSQL FETCH. Right. Pushed with some rewriting. regards, tom lane
Re: TRUNCATE tables referenced by FKs on partitioned tables
On 2018-Jul-12, Michael Paquier wrote: > On Wed, Jul 11, 2018 at 06:59:16PM -0400, Alvaro Herrera wrote: > > Anyway, this patch seems to fix it, and adds what I think is appropriate > > test coverage. > > This looks good to me. I am noticing that the documentation of TRUNCATE > does not mention that when running the command on a partitioned table > then it automatically gets to the children even if CASCADE is not used > and each child partition is not listed. Hmm ... well, that's not new -- I think that came in with pg10. > What is the filler column added in truncpart used for? Nothing. Also column b -- I had an additional different test, but then I discovered it wasn't testing anything new. Removed both. Pushed, thanks for looking! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
On 12/07/18 19:00, Fabien COELHO wrote: How pgbenchs prints a progress if none were printed, or if the last progress was over 0.5 seconds ago, so as to have kind of a catchup in the end. I don't understand the 0.5 second rule. For the tests, we only need to ensure that at least one progress report is printed, right? I'm not so sure;-) I do not want to trust the threadRun loop in case of heavy load or whatever issue a run may encounter, so this approach ensures that structurally there is always one output even of the whole loop went very wrong. I still don't understand. For the testing, we just need to make sure that at least one progress report is always printed, if -P is used. Right? So where does the 0.5 second rule come in? Can't we just do "if (no progress reports were printed) { print progress report; }" at the end? It also adds a small feature which is that there is always a final progress when the run is completed, which can be useful when computing progress statistics, otherwise some transactions could not be reported in any progress at all. Any transactions in the last 0.5 seconds might still not get reported in any progress reports. when there is nothing to do. This ensures that the -T 2 tap test runs for at least 2 seconds, whatever. If the host is overload it might be more, but it cannot be less unless something was wrong. If you want to write a test that checks that a two-second test takes at least two seconds, can't you just not use throttling in that test? Indeed… but then throttling would not be tested:-) The point of the test is to exercise all time-related options, including throttling with a reasonable small value. Ok. I don't think that's really worthwhile. If we add some code that only runs in testing, then we're not really testing the real thing. I wouldn't trust the test to tell much. Let's just leave out that magic environment variable thing, and try to get the rest of the patch finished. - Heikki
Re: [PATCH] Add missing type conversion functions for PL/Python
On 12/07/18 18:06, Nikita Glukhov wrote: I have added some cross-type test cases and now almost all new code is covered (excluding several error cases which can be triggered only by custom numeric type implementations). Thanks! Some of those new tests actually fail, if you run them against unpatched master. For example: SELECT * FROM test_type_conversion_float8_int2(100::float8); INFO: (100.0, ) - test_type_conversion_float8_int2 --- - 100 -(1 row) - +ERROR: invalid input syntax for integer: "100.0" +CONTEXT: while creating return value +PL/Python function "test_type_conversion_float8_int2" So this patch is making some subtle changes to behavior. I don't think we want that. - Heikki
Re: GiST VACUUM
Hi! PFA v5 of the patch series. > 11 июля 2018 г., в 0:07, Heikki Linnakangas написал(а): > > This seems misplaced. This code deals with internal pages, and as far as I > can see, this patch never marks internal pages as deleted, only leaf pages. > However, we should have something like this in the leaf-page branch, to deal > with the case that an insertion lands on a page that was concurrently > deleted. Did you have any tests, where an insertion runs concurrently with > vacuum, that would exercise this? That bug could manifest only in case of crash between removing downlinks and marking pages deleted. I do not know how to test this reliably. Internal pages are locked before leafs and locks are coupled. No cuncurrent backend can see downlinks to pages being deleted, unless crash happens. I've replaced code covering this situation into leaf code path and added comment. > > The code in gistbulkdelete() seems pretty expensive. In the first phase, it > records the parent of every empty leaf page it encounters. In the second > phase, it scans every leaf page of that parent, not only those leaves that > were seen as empty. It is fixed in second patch of the series. > > I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted > page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but > here it's used for a critical check. This seems to be the same mechanism we > use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, > not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, > not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for > explanation. This patch is missing any comments to explain how this works in > GiST. I've replaced usage of GetCurrentTransactionId() with ReadNewTransactionId() and added explanation of what is going on. Also, I've added comments about that pd_prune_xid usage. There is no other use in GiST for this field. And there is no other room to place this xid on a page without pg_upgrade. > > If you crash in the middle of gistbulkdelete(), after it has removed the > downlink from the parent, but before it has marked the leaf page as deleted, > the leaf page is "leaked". I think that's acceptable, but a comment at least > would be good. Added explanatory comment between WAL-logging downlink removal and marking pages deleted. Thank you for reviewing the patch! Best regards, Andrey Borodin. 0002-Physical-GiST-scan-during-VACUUM-v5.patch Description: Binary data 0001-Delete-pages-during-GiST-VACUUM-v5.patch Description: Binary data
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
The point is to avoid building the message with dynamic allocation and so if in the end it is not used. Ok! About avoidance - I'm afraid there's one more piece of debugging code with the same problem: Indeed. I'd like to avoid all instances, so that PQExpBufferData is not needed anywhere, if possible. If not possible, then too bad, but I'd prefer to make do with formatted prints only, for simplicity. -- Fabien.
Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
Hello Heikki, Thanks for having a look at this small patch which aim at improving pgbench coverage. How pgbenchs prints a progress if none were printed, or if the last progress was over 0.5 seconds ago, so as to have kind of a catchup in the end. I don't understand the 0.5 second rule. For the tests, we only need to ensure that at least one progress report is printed, right? I'm not so sure;-) I do not want to trust the threadRun loop in case of heavy load or whatever issue a run may encounter, so this approach ensures that structurally there is always one output even of the whole loop went very wrong. It also adds a small feature which is that there is always a final progress when the run is completed, which can be useful when computing progress statistics, otherwise some transactions could not be reported in any progress at all. Looking at the code as it exists, I think it already works like that, although it's by accident. Not sure though, and if we're going to rely on that, it makes sense to make it more explicit. Yep. Also, by default, when running under throttling for 2 seconds at 20 tps, there is a slight chance that no transactions are scheduled and that pgbench returns immediately, hence the added variable to ensure that it lasts something anyway, and that some minimal reporting is always done whatever. when there is nothing to do. This ensures that the -T 2 tap test runs for at least 2 seconds, whatever. If the host is overload it might be more, but it cannot be less unless something was wrong. If you want to write a test that checks that a two-second test takes at least two seconds, can't you just not use throttling in that test? Indeed… but then throttling would not be tested:-) The point of the test is to exercise all time-related options, including throttling with a reasonable small value. Attached is a rebase. -- Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index f0c5149523..26ac6fb37d 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -5597,6 +5597,96 @@ main(int argc, char **argv) return 0; } +/* display the progress report */ +static void +doProgressReport(TState *thread, StatsData *plast, int64 *plast_report, +int64 now, int64 thread_start) +{ + StatsData cur; + int64 run = now - *plast_report, + ntx; + double tps, + total_run, + latency, + sqlat, + lag, + stdev; + chartbuf[64]; + int i; + + /* +* Add up the statistics of all threads. +* +* XXX: No locking. There is no guarantee that we get an +* atomic snapshot of the transaction count and latencies, so +* these figures can well be off by a small amount. The +* progress report's purpose is to give a quick overview of +* how the test is going, so that shouldn't matter too much. +* (If a read from a 64-bit integer is not atomic, you might +* get a "torn" read and completely bogus latencies though!) +*/ + initStats(&cur, 0); + for (i = 0; i < nthreads; i++) + { + mergeSimpleStats(&cur.latency, &thread[i].stats.latency); + mergeSimpleStats(&cur.lag, &thread[i].stats.lag); + cur.cnt += thread[i].stats.cnt; + cur.skipped += thread[i].stats.skipped; + } + + /* we count only actually executed transactions */ + ntx = (cur.cnt - cur.skipped) - (plast->cnt - plast->skipped); + total_run = (now - thread_start) / 100.0; + tps = 100.0 * ntx / run; + if (ntx > 0) + { + latency = 0.001 * (cur.latency.sum - plast->latency.sum) / ntx; + sqlat = 1.0 * (cur.latency.sum2 - plast->latency.sum2) / ntx; + stdev = 0.001 * sqrt(sqlat - 100.0 * latency * latency); + lag = 0.001 * (cur.lag.sum - plast->lag.sum) / ntx; + } + else + { + latency = sqlat = stdev = lag = 0; + } + + if (progress_timestamp) + { + /* +* On some platforms the current system timestamp is +* available in now_time, but rather than get entangled +* with that, we just eat the cost of an extra syscall in +* all cases. +*/ + struct timeval tv; + + gettimeofday(&tv, NULL); + snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s", +(long) tv.tv_sec, (long) (tv.tv_usec / 1000)); + } + else + { + /* round seconds are expected, but the thread may be late */ + snprintf(tbuf, sizeof(tbuf), "%.1f s", t
Re: generating bootstrap entries for array types
Hi John, John Naylor writes: >> On 4/26/18, Tom Lane wrote: >>> if I counted correctly. (Array entries should be ignored for this >>> purpose; maybe we'll autogenerate them someday.) >> >> Hmm, that wouldn't be too hard. Add a new metadata field called >> 'array_type_oid', then if it finds such an OID, teach genbki.pl to >> construct a tuple for the corresponding array type by consulting >> something like >> >> chartypcategoryBKI_ARRAY(A); >> chartypstorage BKI_ARRAY(x); >> ...etc >> >> in the header file, plus copying typalign from the element type. I'll >> whip this up sometime and add it to the next commitfest. > > This turned out to be slightly more complicated than that, but still > pretty straightforward. Doing this in genbki.pl makes DBD::Pg lose its array type information, since it uses Catalog::ParseData() to get it (https://github.com/bucardo/dbdpg/blob/master/types.c#L439). May I suggest moving gen_array_types() to Catalog.pm and calling it from ParseData() if the input file is pg_type.dat, so that it always returns complete data? Thanks, - ilmari -- "A disappointingly low fraction of the human race is, at any given time, on fire." - Stig Sandbeck Mathisen
Re: [PATCH] btree_gist: fix union implementation for variable length columns
It would be easier to figure this out if the btree_gist code weren't so desperately undocumented. Teodor, do you remember why it's like this? Will look. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: Failure assertion in GROUPS mode of window function in current HEAD
On Fri, Jul 13, 2018 at 12:17 AM, Tom Lane wrote: > Masahiko Sawada writes: >> I think we also can update the doc about that GROUPS offset mode >> requires ORDER BY clause. Thoughts? Attached patch updates it. > > Ooops, I forgot to check the docs. This isn't quite the right fix > though --- the problem is that there's a sentence at the end of that > para that now says exactly the wrong thing. I fixed that. > Yeah, that's not the right fix. Thank you for fixing! Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: _isnan() on Windows
On 07/12/2018 10:38 AM, Tom Lane wrote: Andrew Dunstan writes: On 07/12/2018 10:20 AM, Tom Lane wrote: bowerbird and hamerkop have some gripes like this: bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/SPI.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] We actually undef a bunch of things in plperl.h to keep the compiler quiet. Maybe we need to add this to the list? Perhaps. But how do we tell the platforms where we should do that from the ones where we shouldn't? In the _MSCVER section: #ifdef isnan #undef isnan #endif By inspection the perl header is just defining it to _isnan, for every MSC version. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Failure assertion in GROUPS mode of window function in current HEAD
Masahiko Sawada writes: > I think we also can update the doc about that GROUPS offset mode > requires ORDER BY clause. Thoughts? Attached patch updates it. Ooops, I forgot to check the docs. This isn't quite the right fix though --- the problem is that there's a sentence at the end of that para that now says exactly the wrong thing. I fixed that. regards, tom lane
Re: Postgres 11 release notes
"Jonathan S. Katz" writes: > On Jun 30, 2018, at 5:52 PM, Tom Lane wrote: >> Mmm, yeah, I suppose it should say "all framing options" rather than >> implying that we've implemented every other window-related frammish >> there is in the spec. > +1. Attached patch that does exactly that. Pushed, thanks. regards, tom lane
Re: [PATCH] Add missing type conversion functions for PL/Python
On 11.07.2018 21:03, Heikki Linnakangas wrote: On 26/03/18 19:07, Nikita Glukhov wrote: Attached fixed 3th version of the patch: Thanks, I'm reviewing this now. Nice speedup! Thank you for your review. There is no test coverage for some of the added code. You can get a code coverage report with: ./configure --enable-coverage ... make make -C src/pl/plpython check make coverage-html That produces a code coverage report in coverage/index.html. Please look at the coverage of the new functions, and add tests to src/pl/plpython/sql/plpython_types.sql so that all the new code is covered. I have added some cross-type test cases and now almost all new code is covered (excluding several error cases which can be triggered only by custom numeric type implementations). In some places, where you've already checked the object type e.g. with PyFloat_Check(), I think you could use the less safe variants, like PyFloat_AS_DOUBLE() instead of PyFloat_AsDouble(). Since this patch is all about performance, seems worth doing. Fixed. Some of the conversions seem a bit questionable. For example: /* * Convert a Python object to a PostgreSQL float8 datum directly. * If can not convert it directly, fallback to PLyObject_ToScalar * to convert it. */ static Datum PLyObject_ToFloat8(PLyObToDatum *arg, PyObject *plrv, bool *isnull, bool inarray) { if (plrv == Py_None) { *isnull = true; return (Datum) 0; } if (PyFloat_Check(plrv) || PyInt_Check(plrv) || PyLong_Check(plrv)) { *isnull = false; return Float8GetDatum((double)PyFloat_AsDouble(plrv)); } return PLyObject_ToScalar(arg, plrv, isnull, inarray); } The conversion from Python int to C double is performed by PyFloat_AsDouble(). But there's no error checking. And wouldn't PyLong_AsDouble() be more appropriate in that case, since we already checked the python type? Yes, this might be wrong, but PyFloat_AsDouble() internally tries first to convert number to float. Also, after gaining more experience in PL/Python during the implementation of jsonb transforms, I found a lot of similar problems in the code. All of them are fixed in the 4th version of the patch. -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/pl/plpython/expected/plpython_types.out b/src/pl/plpython/expected/plpython_types.out index eda965a..dc1232b 100644 --- a/src/pl/plpython/expected/plpython_types.out +++ b/src/pl/plpython/expected/plpython_types.out @@ -134,6 +134,158 @@ INFO: (None, ) (1 row) +CREATE FUNCTION test_type_conversion_int4_int2(x int4) RETURNS int2 AS $$ +plpy.info(x, type(x)) +return x +$$ LANGUAGE plpythonu; +SELECT * FROM test_type_conversion_int4_int2(100::int4); +INFO: (100, ) + test_type_conversion_int4_int2 + +100 +(1 row) + +SELECT * FROM test_type_conversion_int4_int2(-100::int4); +INFO: (-100, ) + test_type_conversion_int4_int2 + + -100 +(1 row) + +SELECT * FROM test_type_conversion_int4_int2(100::int4); +INFO: (100, ) +ERROR: value "100" is out of range for type smallint +CONTEXT: while creating return value +PL/Python function "test_type_conversion_int4_int2" +SELECT * FROM test_type_conversion_int4_int2(-100::int4); +INFO: (-100, ) +ERROR: value "-100" is out of range for type smallint +CONTEXT: while creating return value +PL/Python function "test_type_conversion_int4_int2" +SELECT * FROM test_type_conversion_int4_int2(null); +INFO: (None, ) + test_type_conversion_int4_int2 + + +(1 row) + +CREATE FUNCTION test_type_conversion_int8_int2(x int8) RETURNS int2 AS $$ +plpy.info(x, type(x)) +return x +$$ LANGUAGE plpythonu; +SELECT * FROM test_type_conversion_int8_int2(100::int8); +INFO: (100L, ) + test_type_conversion_int8_int2 + +100 +(1 row) + +SELECT * FROM test_type_conversion_int8_int2(-100::int8); +INFO: (-100L, ) + test_type_conversion_int8_int2 + + -100 +(1 row) + +SELECT * FROM test_type_conversion_int8_int2(100::int8); +INFO: (100L, ) +ERROR: value "100" is out of range for type smallint +CONTEXT: while creating return value +PL/Python function "test_type_conversion_int8_int2" +SELECT * FROM test_type_conversion_int8_int2(-100::int8); +INFO: (-100L, ) +ERROR: value "-100" is out of range for type smallint +CONTEXT: while creating return value +PL/Python function "test_type_conversion_int8_int2" +SELECT * FROM test_type_conversion_int8_int2(null); +INFO: (None, ) + test_type_conversion_int8_int2 + + +(1 row) + +CREATE FUNCTION test_type_c
Re: cost_sort() improvements
One more thought about estimating the worst case - I wonder if simply multiplying the per-tuple cost by 1.5 is the right approach. It does not seem particularly principled, and it's trivial simple to construct counter-examples defeating it (imagine columns with 99% of the rows having the same value, and then many values in exactly one row - that will defeat the ndistinct-based group size estimates). Agree. But right now that is best what we have. and constant 1.5 easily could be changed to 1 or 10 - it is just arbitrary value, intuitively chosen. As I mentioned in v7 patch description, new estimation function provides ~10% bigger estimation and this constant should not be very large, because we could easily get overestimation. The reason why constructing such counter-examples is so simple is that this relies entirely on the ndistinct stats, ignoring the other stats we already have. I think this might leverage the per-column MCV lists (and eventually multi-column MCV lists, if that ever gets committed). We're estimating the number of tuples in group (after fixing values in the preceding columns), because that's what determines the number of comparisons we need to make at a given stage. The patch does this by estimating the average group size, by estimating ndistinct of the preceding columns G(i-1), and computing ntuples/G(i-1). But consider that we also have MCV lists for each column - if there is a very common value, it should be there. So let's say M(i) is a frequency of the most common value in i-th column, determined either from the MCV list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i) for the fist i columns, then the largest possible group of values when comparing values in i-th column is N * m(i-1) This seems like a better way to estimate the worst case. I don't know if this should be used instead of NG(i), or if those two estimates should be combined in some way. Agree, definitely we need an estimation of larger group size to use it in cost_sort. But I don't feel power to do that, pls, could you attack a computing group size issue? Thank you. I think estimating the largest group we need to sort should be helpful for the incremental sort patch, so I'm adding Alexander to this thread.Agree I think so. suggested estimation algorithm should be modified to support leading preordered keys and this form could be directly used in GROUP BY and incremental sort patches. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: cost_sort() improvements
V8 contains fixes of Tomas Vondra complaints: - correct usage of comparison_cost - remove uneeded check of sortop -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index a2a7e0c520..521a7d3c9a 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1612,6 +1612,250 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm) rterm->pathtarget->width); } +/* + * is_fake_var + * Workaround for generate_append_tlist() which generates fake Vars with + * varno == 0, that will cause a fail of estimate_num_group() call + */ +static bool +is_fake_var(Expr *expr) +{ + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; + + return (IsA(expr, Var) && ((Var *) expr)->varno == 0); +} + +/* + * get_width_cost_multiplier + * Returns relative complexity of comparing two valyes based on it's width. + * The idea behind - long values have more expensive comparison. Return value is + * in cpu_operator_cost unit. + */ +static double +get_width_cost_multiplier(PlannerInfo *root, Expr *expr) +{ + double width = -1.0; /* fake value */ + + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; + + /* Try to find actual stat in corresonding relation */ + if (IsA(expr, Var)) + { + Var *var = (Var *) expr; + + if (var->varno > 0 && var->varno < root->simple_rel_array_size) + { + RelOptInfo *rel = root->simple_rel_array[var->varno]; + + if (rel != NULL && +var->varattno >= rel->min_attr && +var->varattno <= rel->max_attr) + { +int ndx = var->varattno - rel->min_attr; + +if (rel->attr_widths[ndx] > 0) + width = rel->attr_widths[ndx]; + } + } + } + + /* Didn't find any actual stats, use estimation by type */ + if (width < 0.0) + { + Node *node = (Node*) expr; + + width = get_typavgwidth(exprType(node), exprTypmod(node)); + } + + /* + * Any value in pgsql is passed by Datum type, so any operation with value + * could not be cheaper than operation with Datum type + */ + if (width <= sizeof(Datum)) + return 1.0; + + /* + * Seems, cost of comparision is not directly proportional to args width, + * because comparing args could be differ width (we known only average over + * column) and difference often could be defined only by looking on first + * bytes. So, use log16(width) as estimation. + */ + return 1.0 + 0.125 * LOG2(width / sizeof(Datum)); +} + +/* + * compute_cpu_sort_cost + * compute CPU cost of sort (i.e. in-memory) + * + * NOTE: some callers currently pass NIL for pathkeys because they + * can't conveniently supply the sort keys. In this case, it will fallback to + * simple comparison cost estimate. + * + * Estimation algorithm is based on ideas from course Algorithms, + * Robert Sedgewick, Kevin Wayne, https://algs4.cs.princeton.edu/home/ and paper + * "Quicksort Is Optimal For Many Equal Keys", Sebastian Wild, + * arXiv:1608.04906v4 [cs.DS] 1 Nov 2017. + * + * In term of that papers, let N - number of tuples, Xi - number of tuples with + * key Ki, then estimation is: + * log(N! / (X1! * X2! * ..)) ~ sum(Xi * log(N/Xi)) + * In our case all Xi are the same because noew we don't have an estimation of + * group sizes, we have only estimation of number of groups. In this case, + * formula becomes: N * log(NumberOfGroups). Next, to support correct estimation + * of multicolumn sort we need separately compute each column, so, let k is a + * column number, Gk - number of groups defined by k columns: + * N * sum( Fk * log(Gk) ) + * Fk is sum of functions costs for k columns. + */ + +static Cost +compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, + Cost comparison_cost, + double tuples, double output_tuples, bool heapSort) +{ + Cost per_tuple_cost = 0.0; + ListCell *lc; + List *pathkeyExprs = NIL; + double tuplesPerPrevGroup = tuples; + bool has_fake_var = false; + double totalFuncCost = 0.0; + + /* fallback if pathkeys is unknown */ + if (list_length(pathkeys) == 0) + { + /* + * If we'll use a bounded heap-sort keeping just K tuples in memory, for + * a total number of tuple comparisons of N log2 K; but the constant + * factor is a bit higher than for quicksort. Tweak it so that the + * cost curve is continuous at the crossover point. + */ + output_tuples = (heapSort) ? 2.0 * output_tuples : tuples; + per_tuple_cost += 2.0 * cpu_operator_cost * LOG2(output_tuples); + + return per_tuple_cost * tuples; + } + + totalFuncCost = 1.0; + + /* + * Computing total cost of sorting takes into account: + * - per column comparison function cost + * - we try to compute needed number of comparison per column + */ + + foreach(lc, pathkeys) + { + PathKey*pathkey = (PathKey*)lfirst(lc); + Cost funcCost = 1.0; /* fallback value */ + EquivalenceMember
Re: cost_sort() improvements
OK, so Fi is pretty much whatever CREATE FUNCTION ... COST says, right? exactly Hmm, makes sense. But doesn't that mean it's mostly a fixed per-tuple cost, not directly related to the comparison? For example, why should it be multiplied by C0? That is, if I create a very expensive comparator (say, with cost 100), why should it increase the cost for transferring the tuple to CPU cache, unpacking it, etc.? I'd say those costs are rather independent of the function cost, and remain rather fixed, no matter what the function cost is. Perhaps you haven't noticed that, because the default funcCost is 1? May be, but see my email https://www.postgresql.org/message-id/ee14392b-d753-10ce-f5ed-7b2f7e277512%40sigaev.ru about additional term proportional to N The number of new magic constants introduced by this patch is somewhat annoying. 2.0, 1.5, 0.125, ... :-( 2.0 is removed in last patch, 1.5 leaved and could be removed when I understand you letter with group size estimation :) 0.125 should be checked, and I suppose we couldn't remove it at all because it "average over whole word" constant. - Final cost is cpu_operator_cost * N * sum(per column costs described above). Note, for single column with width <= sizeof(datum) and F1 = 1 this formula gives exactly the same result as current one. - for Top-N sort empiric is close to old one: use 2.0 multiplier as constant under log2, and use log2(Min(NGi, output_tuples)) for second and following columns. I think compute_cpu_sort_cost is somewhat confused whether per_tuple_cost is directly a cost, or a coefficient that will be multiplied with cpu_operator_cost to get the actual cost. At the beginning it does this: per_tuple_cost = comparison_cost; so it inherits the value passed to cost_sort(), which is supposed to be cost. But then it does the work, which includes things like this: per_tuple_cost += 2.0 * funcCost * LOG2(tuples); where funcCost is pretty much pg_proc.procost. AFAIK that's meant to be a value in units of cpu_operator_cost. And at the end it does this per_tuple_cost *= cpu_operator_cost; I.e. it gets multiplied with another cost. That doesn't seem right. Huh, you are right, will fix in v8. Also, why do we need this? if (sortop != InvalidOid) { Oid funcOid = get_opcode(sortop); funcCost = get_func_cost(funcOid); } Safety first :). Will remove. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: _isnan() on Windows
Andrew Dunstan writes: > On 07/12/2018 10:20 AM, Tom Lane wrote: >> bowerbird and hamerkop have some gripes like this: >> >> bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : >> macro redefinition (src/pl/plperl/SPI.c) >> [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] > We actually undef a bunch of things in plperl.h to keep the compiler > quiet. Maybe we need to add this to the list? Perhaps. But how do we tell the platforms where we should do that from the ones where we shouldn't? regards, tom lane
Re: _isnan() on Windows
On 07/12/2018 10:20 AM, Tom Lane wrote: Michael Paquier writes: On Wed, Jul 11, 2018 at 09:13:40AM -0400, Alvaro Herrera wrote: I just pushed it before seeing your message. Fine as well, thanks for picking this up. The buildfarm shows no failures about this patch. I scraped all the compiler warnings from the buildfarm this morning, and I see no new ones that could be blamed on this change, so I think we're good. bowerbird and hamerkop have some gripes like this: bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/SPI.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/Util.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/plperl.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\hstore_plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\jsonb_plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/SPI.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/Util.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/plperl.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\hstore_plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\jsonb_plperl.vcxproj] but those were there before too. Not sure if there's anything we can/should try to do about that. We actually undef a bunch of things in plperl.h to keep the compiler quiet. Maybe we need to add this to the list? cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: cost_sort() improvements
At least [1] and [2] hit into to that issues and have an objections/questions about correctness of cost sort estimation. Suggested patch tries to improve current estimation and solve that issues. Sorry for long delay but issue was even more complicated than I thought. When I tried to add cost_sort to GROUP BY patch I found it isn't work well as I hoped :(. The root of problem is suggested cost_sort improvement doesn't take into account uniqueness of first column and it's cost always the same. I tried to make experiments with all the same and all unique column and found that execution time could be significantly differ (up to 3 times on 1e7 randomly generated integer values). And I went to read book and papers. So, I suggest new algorithm based on ideas in [3], [4]. In term of that papers, let I use designations from previous my email and Xi - number of tuples with key Ki, then estimation is: log(N! / (X1! * X2! * ..)) ~ sum(Xi * log(N/Xi)) In our case all Xi are the same because now we don't have an estimation of group sizes, we have only estimation of number of groups. In this case, formula becomes: N * log(NumberOfGroups). Next, to support correct estimation of multicolumn sort we need separately compute each column, so, let k is a column number, Gk - number of groups defined by k columns: N * sum( FCk * log(Gk) ) and FCk is a sum(Cj) over k columns. Gk+1 is defined as estimate_num_groups(NGk) - i.e. it's recursive, that's means that comparison of k-th columns includes all comparison j-th columns < k. Next, I found that this formula gives significant underestimate with N < 1e4 and using [5] (See Chapter 8.2 and Theorem 4.1) found that we can use N + N * log(N) formula which actually adds a multiplier in simple case but it's unclear for me how to add that multimplier to multicolumn formula, so I added just a separate term proportional to N. As demonstration, I add result of some test, *sh and *plt are scripts to reproduce results. Note, all charts are normalized because computed cost as not a milliseconds. p.pl is a parser for JSON format of explain analyze. N test - sort unique values with different number of tuples NN test - same as previous one but sort of single value (all the same values) U test - fixed number of total values (1e7) but differ number of unique values Hope, new estimation much more close to real sort timing. New estimation function gives close result to old estimation function on trivial examples, but ~10% more expensive, and three of regression tests aren't passed, will look closer later. Patch doesn't include regression test changes. [1] https://commitfest.postgresql.org/18/1124/ [2] https://commitfest.postgresql.org/18/1651/ [3] https://algs4.cs.princeton.edu/home/, course Algorithms, Robert Sedgewick, Kevin Wayne, [4] "Quicksort Is Optimal For Many Equal Keys", Sebastian Wild, arXiv:1608.04906v4 [cs.DS] 1 Nov 2017 [5] "Introduction to algorithms", Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, ISBN 0-07-013143-0 -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index a2a7e0c520..3588cff802 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1612,6 +1612,252 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm) rterm->pathtarget->width); } +/* + * is_fake_var + * Workaround for generate_append_tlist() which generates fake Vars with + * varno == 0, that will cause a fail of estimate_num_group() call + */ +static bool +is_fake_var(Expr *expr) +{ + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; + + return (IsA(expr, Var) && ((Var *) expr)->varno == 0); +} + +/* + * get_width_cost_multiplier + * Returns relative complexity of comparing two valyes based on it's width. + * The idea behind - long values have more expensive comparison. Return value is + * in cpu_operator_cost unit. + */ +static double +get_width_cost_multiplier(PlannerInfo *root, Expr *expr) +{ + double width = -1.0; /* fake value */ + + if (IsA(expr, RelabelType)) + expr = (Expr *) ((RelabelType *) expr)->arg; + + /* Try to find actual stat in corresonding relation */ + if (IsA(expr, Var)) + { + Var *var = (Var *) expr; + + if (var->varno > 0 && var->varno < root->simple_rel_array_size) + { + RelOptInfo *rel = root->simple_rel_array[var->varno]; + + if (rel != NULL && +var->varattno >= rel->min_attr && +var->varattno <= rel->max_attr) + { +int ndx = var->varattno - rel->min_attr; + +if (rel->attr_widths[ndx] > 0) + width = rel->attr_widths[ndx]; + } + } + } + + /* Didn't find any actual stats, use estimation by type */ + if (width < 0.0) + { + Node *node = (Node*) expr; + + width = get_typavgwidth(exprType(node), exprTypmod(node)
Re: _isnan() on Windows
Michael Paquier writes: > On Wed, Jul 11, 2018 at 09:13:40AM -0400, Alvaro Herrera wrote: >> I just pushed it before seeing your message. > Fine as well, thanks for picking this up. The buildfarm shows no > failures about this patch. I scraped all the compiler warnings from the buildfarm this morning, and I see no new ones that could be blamed on this change, so I think we're good. bowerbird and hamerkop have some gripes like this: bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/SPI.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/Util.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/plperl.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\hstore_plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\jsonb_plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/SPI.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/Util.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/plperl.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\hstore_plperl.vcxproj] bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition [G:\prog\bf\root\HEAD\pgsql.build\jsonb_plperl.vcxproj] but those were there before too. Not sure if there's anything we can/should try to do about that. regards, tom lane
Re: [HACKERS] WAL logging problem in 9.4.3?
On 12/07/18 16:51, Andrew Dunstan wrote: On 07/10/2018 11:32 PM, Michael Paquier wrote: On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote: Thanks for picking this up! (I hope this gets through the email filters this time, sending a shell script seems to be difficult. I also trimmed the CC list, if that helps.) On 04/07/18 07:59, Michael Paquier wrote: Hence I propose the patch attached which disables the TRUNCATE and COPY optimizations for two cases, which are the ones actually causing problems. One solution has been presented by Simon here for COPY, which is to disable the optimization when there are no blocks on a relation with wal_level = minimal: https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=yetxvmuu1yk4umxqy2g...@mail.gmail.com For back-patching, I find that really appealing. This fails in the case that there are any WAL-logged changes to the table while the COPY is running. That can happen at least if the table has an INSERT trigger, that performs operations on the same table, and the COPY fires the trigger. That scenario is covered by the little bash script I posted earlier in this thread (https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached is a new version of that script, updated to make it work with v11. Thanks for the pointer. My tap test has been covering two out of the three scenarios you have in your script. I have been able to convert the extra as the attached, and I have added as well an extra test with TRUNCATE triggers. So it seems to me that we want to disable the optimization if any type of trigger are defined on the relation copied to as it could be possible that these triggers work on the blocks copied as well, for any BEFORE/AFTER and STATEMENT/ROW triggers. What do you think? Yeah, this seems like the only sane approach. Doesn't have to be a trigger, could be a CHECK constraint, datatype input function, etc. Admittedly, having a datatype input function that inserts to the table is worth a "huh?", but I'm feeling very confident that we can catch all such cases, and some of them might even be sensible. - Heikki
Re: [HACKERS] WAL logging problem in 9.4.3?
On 07/10/2018 11:32 PM, Michael Paquier wrote: On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote: Thanks for picking this up! (I hope this gets through the email filters this time, sending a shell script seems to be difficult. I also trimmed the CC list, if that helps.) On 04/07/18 07:59, Michael Paquier wrote: Hence I propose the patch attached which disables the TRUNCATE and COPY optimizations for two cases, which are the ones actually causing problems. One solution has been presented by Simon here for COPY, which is to disable the optimization when there are no blocks on a relation with wal_level = minimal: https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=yetxvmuu1yk4umxqy2g...@mail.gmail.com For back-patching, I find that really appealing. This fails in the case that there are any WAL-logged changes to the table while the COPY is running. That can happen at least if the table has an INSERT trigger, that performs operations on the same table, and the COPY fires the trigger. That scenario is covered by the little bash script I posted earlier in this thread (https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached is a new version of that script, updated to make it work with v11. Thanks for the pointer. My tap test has been covering two out of the three scenarios you have in your script. I have been able to convert the extra as the attached, and I have added as well an extra test with TRUNCATE triggers. So it seems to me that we want to disable the optimization if any type of trigger are defined on the relation copied to as it could be possible that these triggers work on the blocks copied as well, for any BEFORE/AFTER and STATEMENT/ROW triggers. What do you think? Yeah, this seems like the only sane approach. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On 04/06/2018 08:00 PM, Claudio Freire wrote: On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire wrote: On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas wrote: On 06/04/18 01:59, Claudio Freire wrote: The iteration interface, however, seems quite specific for the use case of vacuumlazy, so it's not really a good abstraction. Can you elaborate? It does return the items one block at a time. Is that what you mean by being specific for vacuumlazy? I guess that's a bit special, but if you imagine some other users for this abstraction, it's probably not that unusual. For example, if we started using it in bitmap heap scans, a bitmap heap scan would also want to get the TIDs one block number at a time. But you're also tying the caller to the format of the buffer holding those TIDs, for instance. Why would you, when you can have an interface that just iterates TIDs and let the caller store them if/however they want? I do believe a pure iterator interface is a better interface. Between the b-tree or not discussion and the refactoring to separate the code, I don't think we'll get this in the next 24hs. So I guess we'll have ample time to poner on both issues during the next commit fest. There doesn't seem to have been much pondering done since then, at least publicly. Can we make some progress on this? It's been around for a long time now. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: buildfarm vs code
On 07/12/2018 05:54 AM, Heikki Linnakangas wrote: On 02/07/18 10:57, Peter Eisentraut wrote: On 05.06.18 18:09, Andrew Dunstan wrote: The first should be simple and non-controversial. It allows src/tools/msvc/build.pl to be called in such a way that it only creates the project files and then stops. This is a one line addition to the script and should affect nobody not using the option. A patch for it is attached. This seems reasonable. There's already a separate script for this, src/tools/msvc/mkvcbuild.pl. Doesn't that do the same thing? Yes, it does, more or less. For some reason I'd completely forgotten about it. I'll withdraw the patch. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Binary difference in pg_internal.init after running pg_initdb multiple times
"samuel.coulee" <313914...@qq.com> writes: > In the PG source code function "write_relcache_init_file()", I found that > the whole 'Relation' structs were directly written into the file > 'pg_internal.init'. This brings some binary differences of that file, if we > run pg_initdb multiple times, because the struct 'Relation' contains some > pointer fields. There's never been any expectation that that file is bitwise-reproducible. The order of entries is arbitrary, there are probably alignment padding bytes that contain garbage, etc etc. So I'm not buying into your apparent goal here. > And my question is : Could we clear the pointer values in 'Relation' before > calling write_item(...)? No; that's live data with no backup copy. Conceivably we could copy the structure and zero out useless fields before writing from the copy, but that adds code, cycles, maintenance effort, and risk of bugs. regards, tom lane
Re: pg_create_logical_replication_slot returns text instead of name
On Thu, Jul 12, 2018 at 07:00:48PM +0900, Masahiko Sawada wrote: > The documentation[1] says that both pg_create_logical_replication_slot > and pg_create_physical_replication_slot returns slot_name as a name > type. But only pg_create_logical_replication_slot returns it as text > type. I think these should be united as the name type. Attached small > patch fixes it. > > [1] > https://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-REPLICATION That's a small thing, but I agree with you. As far as I can see slot names are always mapped with the name type. I'll push that tomorrow if there are no objections. -- Michael signature.asc Description: PGP signature
Re: Negotiating the SCRAM channel binding type
On Wed, Jul 11, 2018 at 04:00:47PM +0300, Heikki Linnakangas wrote: > Looking at the GnuTLS docs, I believe it has everything we need. > gnutls_certificate_get_peers() and gnutls_certificate_get_ours() can be used > to get the certificate, and gnutls_x509_crt_get_signature_algorithm() gets > the signatureAlgorithm. Looking at the docs, there is gnutls_x509_crt_get_fingerprint() which can provide the certificate hash. So if the signature algorithm is MD5 or SHA-1, it would be simple enough to upgrade it to SHA-256 and calculate the hash. They have way better docs than OpenSSL, which is nice. -- Michael signature.asc Description: PGP signature
Re: Negotiating the SCRAM channel binding type
On Thu, Jul 12, 2018 at 12:34:51PM +0300, Heikki Linnakangas wrote: > Meh. We're not going implement tls-unique, anyway, in some of the upcoming > non-OpenSSL TLS implementations that don't support it. True enough. Only GnuTLS supports it: https://www.gnutls.org/manual/html_node/Channel-Bindings.html > Hmm. That is actually in a section called "Default Channel Binding". And the > first paragraph says "A default channel binding type agreement process for > all SASL application protocols that do not provide their own channel binding > type agreement is provided as follows". Given that, it's not entirely clear > to me if the requirement to support tls-unique is for all implementations of > SCRAM, or only those applications that don't provide their own channel > binding type agreement. I am not sure, but I get that as tls-unique must be the default if available, so if it is technically possible to have it we should have it in priority. If not, then a channel binding type which is supported by both the server and the client can be chosen. -- Michael signature.asc Description: PGP signature
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
I think we should add this to open items list so that it gets tracked. On Thu, Jul 12, 2018 at 6:31 PM, Ashutosh Bapat wrote: > On Thu, Jul 12, 2018 at 11:10 AM, Amit Langote > wrote: >>> >>> I think your fix is correct. I slightly modified it along with updating >>> nearby comments and added regression tests. >> >> I updated regression tests to reduce lines. There is no point in >> repeating tests like v2 patch did. > > + * > + * For hash partitioning however, it is possible to combine null and non- > + * null keys in a pruning step, so do this only if *all* partition keys > + * are involved in IS NULL clauses. > > I don't think this is true. When equality conditions and IS NULL clauses cover > all partition keys of a hash partitioned table and do not have contradictory > clauses, we should be able to find the partition which will remain unpruned. I > see that we already have this supported in get_matching_hash_bounds() > /* > * For hash partitioning we can only perform pruning based on equality > * clauses to the partition key or IS NULL clauses. We also can only > * prune if we got values for all keys. > */ > if (nvalues + bms_num_members(nullkeys) == partnatts) > { > > */ > -if (!generate_opsteps) > +if (!bms_is_empty(nullkeys) && > +(part_scheme->strategy != PARTITION_STRATEGY_HASH || > + bms_num_members(nullkeys) == part_scheme->partnatts)) > > So, it looks like we don't need bms_num_members(nullkeys) == > part_scheme->partnatts there. > > Also, I think, we don't know how some new partition strategy will treat NULL > values so above condition looks wrong to me. Instead it should explicitly > check > the strategies for which we know that the NULL values go to a single > partition. > > /* > - * Note that for IS NOT NULL clauses, simply having step suffices; > - * there is no need to propagate the exact details of which keys are > - * required to be NOT NULL. Hash partitioning expects to see actual > - * values to perform any pruning. > + * There are no OpExpr's, but there are IS NOT NULL clauses, which > + * can be used to eliminate the null-partition-key-only partition. > > I don't understand this. When there are IS NOT NULL clauses for all the > partition keys, it's only then that we could eliminate the partition > containing > NULL values, not otherwise. > > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Binary difference in pg_internal.init after running pg_initdb multiple times
On 07/12/2018 10:08 AM, samuel.coulee wrote: Hi, In the PG source code function "write_relcache_init_file()", I found that the whole 'Relation' structs were directly written into the file 'pg_internal.init'. This brings some binary differences of that file, if we run pg_initdb multiple times, because the struct 'Relation' contains some pointer fields. And my question is : Could we clear the pointer values in 'Relation' before calling write_item(...)? No benefit regarding function or performance, just for binary compatibility... Binary compatibility with what? Can you describe a scenario where this actually matters? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On Thu, Jul 12, 2018 at 11:10 AM, Amit Langote wrote: >> >> I think your fix is correct. I slightly modified it along with updating >> nearby comments and added regression tests. > > I updated regression tests to reduce lines. There is no point in > repeating tests like v2 patch did. + * + * For hash partitioning however, it is possible to combine null and non- + * null keys in a pruning step, so do this only if *all* partition keys + * are involved in IS NULL clauses. I don't think this is true. When equality conditions and IS NULL clauses cover all partition keys of a hash partitioned table and do not have contradictory clauses, we should be able to find the partition which will remain unpruned. I see that we already have this supported in get_matching_hash_bounds() /* * For hash partitioning we can only perform pruning based on equality * clauses to the partition key or IS NULL clauses. We also can only * prune if we got values for all keys. */ if (nvalues + bms_num_members(nullkeys) == partnatts) { */ -if (!generate_opsteps) +if (!bms_is_empty(nullkeys) && +(part_scheme->strategy != PARTITION_STRATEGY_HASH || + bms_num_members(nullkeys) == part_scheme->partnatts)) So, it looks like we don't need bms_num_members(nullkeys) == part_scheme->partnatts there. Also, I think, we don't know how some new partition strategy will treat NULL values so above condition looks wrong to me. Instead it should explicitly check the strategies for which we know that the NULL values go to a single partition. /* - * Note that for IS NOT NULL clauses, simply having step suffices; - * there is no need to propagate the exact details of which keys are - * required to be NOT NULL. Hash partitioning expects to see actual - * values to perform any pruning. + * There are no OpExpr's, but there are IS NOT NULL clauses, which + * can be used to eliminate the null-partition-key-only partition. I don't understand this. When there are IS NOT NULL clauses for all the partition keys, it's only then that we could eliminate the partition containing NULL values, not otherwise. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Temporary WAL segments files not cleaned up after an instance crash
On 12/07/18 15:38, Michael Paquier wrote: On Thu, Jul 12, 2018 at 01:15:03PM +0300, Heikki Linnakangas wrote: On 12/07/18 10:44, Michael Paquier wrote: + snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name); + elog(DEBUG2, "removed temporary WAL file \"%s\"", path); + unlink(path); The elog message says "removed", but the removal actually happens after the elog. "removing" would be more accurate. Or just move the elog() after the file is actually removed? Would you be fine with that? Sure. - Heikki
Re: Temporary WAL segments files not cleaned up after an instance crash
On Thu, Jul 12, 2018 at 01:15:03PM +0300, Heikki Linnakangas wrote: > On 12/07/18 10:44, Michael Paquier wrote: > > + snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name); > > + elog(DEBUG2, "removed temporary WAL file \"%s\"", path); > > + unlink(path); > > The elog message says "removed", but the removal actually happens after the > elog. "removing" would be more accurate. Or just move the elog() after the file is actually removed? Would you be fine with that? -- Michael signature.asc Description: PGP signature
Re: make installcheck-world in a clean environment
Hello Tom, 11.07.2018 23:15, Tom Lane wrote: > >> /make clean/ >> # Also you can just install binary packages to get the same state. >> make installcheck-world >> # This check fails. > I do not think that should be expected to work. It would require that > "make installcheck" first invoke "make all" (to rebuild the stuff you > threw away with "make clean"), which is rather antithetical to its > purpose. Specifically, installcheck is supposed to be checking something > you already built; so having it do a fresh build seems to introduce > version-skew hazards that we don't need. If I understand correctly, the installcheck target should not use the stuff in the build directory (that is thrown away with 'make clean'), but instead should use/check the installed assets. In fact with REL_10_STABLE you can run "make clean && make installcheck" and it works just fine (without calling 'make all'). It's sufficient to have a working instance running. And I think, this behavior is correct — "installcheck" supposed to check not something we built ("check" is supposed to do that), but something we installed. And only if I run "make installcheck-world" with REL_10_STABLE I get an error related to ECPG, I've referred to in the first message in this thread. In the master branch there was some changes that prevent "make clean && make installcheck" scenario to run, but I think it can (and should) be fixed too. (When I prepared the patches, there were no differences between these branches in this aspect.) So for me the question is what assets should the installcheck target be checking? Installed or built ones? For example, if psql/pg_dump/ecpg/... is installed in /usr/local/pgsql/bin/, should it be checked by the installcheck? And if we target the installed stuff then why do we need to build something (or have something built) for installcheck? Best regards, -- Alexander Lakhin Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Copy function for logical replication slots
On Mon, Jul 9, 2018 at 10:34 AM, Michael Paquier wrote: > On Mon, Jul 09, 2018 at 10:06:00AM +0900, Masahiko Sawada wrote: >> I think that this patch might be splitted but I will be able to send >> an updated patch in the next week. As you suggestion this patch needs >> more careful thoughts. I'll move this patch to the next commit fest if >> I will not be able to sent it. Is that okay? > > Fine by me. Thanks for the update. Attached new version of patch incorporated the all comments I got from Michael-san. To prevent the WAL segment file of restart_lsn of the origin slot from removal during creating the target slot, I've chosen a way to copy new one while holding the origin one. One problem to implement this way is that the current replication slot code doesn't allow us to do straightforwardly; the code assumes that the process creating a new slot is not holding another slot. So I've changed the copy functions so that it save temporarily MyReplicationSlot and then restore and release it after creation the target slot. To ensure that both the origin and target slot are released properly in failure cases I used PG_ENSURE_ERROR_CLEANUP(). That way, we can keep the code changes of the logical decoding at a minimum. I've thought that we can change the logical decoding code so that it can assumes that the process can have more than one slots at once but it seems overkill to me. Please review it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center v4-0001-Copy-function-for-logical-and-physical-replicatio.patch Description: Binary data