Re: Feature Request - DDL deployment with logical replication
Hi Jeremy, > My whole point is that in most architectures, DBAs decide to deploy the same > SQL on providers and subscribers. Yes it isn't perfect, but IMO, it is very > helpful to try to automate that idea, as opposed to trying to actually > replicate DDL at the low level. The latter is better, yes, but seems to > have proven extremely difficult. Hence, why you see the advent of functions > to pipe DDL through the replication stream. > The community is currently working on in the current commitfest to try and get logical decoding of 2PC in into the core. Once something like that gets in, for a majority of subset of DDLs (which works inside transaction blocks), one of the use cases of that functionality could be to trap these DDLs and convert them into implicit 2PC commands. Details need to be worked out, but we would get all the logical replication cluster nodes in sync with each other and issue a PREPARE transaction involving this DDL on all nodes in the logical replication cluster. If any of the nodes is not able to successfully prepare this DDL, then we can rollback or else commit the 2PC, thus moving the entire logical cluster consistently in terms of schema changes. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: 2018-03 Commitfest Summary (Andres #1)
Hello Bruce, Has anyone considered moving pgbench out of our git tree and into a separate project where a separate team could maintain and improve it? The movements has been the exact reverse: it was initially in contrib where it had some independence, and has been promoted to the main source tree by Peter Eisentraut in March 2015, effective on 9.5 release. Not sure why... Handy dev tool for testing? One regression test in pgbench is really testing some postgres feature, and it could be used more for this purpose (eg generating a continuous stream of sql stuff to test failover, replication, ...). As pointed out by Pavel, there is significant code sharing with psql (scanner, \if stuff), which may grow even more if pgbench client-side expressions are moved there as well (whether this is actually desired is pretty unclear, though). So I do not think it would be desirable or practical to have it outside. However, it helps explain the disagreements about pgbench features: pgbench internal developer-oriented use for postgres is somehow limited, and a lot of new features are suggested with an external performance benchmarking use in mind, about which core committers do not seem much interested. -- Fabien.
Re: csv format for psql
Bonjour Daniel, For csv, Fabien and Peter expressed the opinion that we shouldn't create another fieldsep-like variable specifically for it, but instead reuse fieldsep. That's what my latest patch does. Now it turns out that sharing fieldsep comes with some problems. Personnaly I do not have any problem with CSV defaulting to '|' separator, given that anyway people often use anything but a comma for the purpose, including '|'. However Pavel wants to block the patch on this point. Too bad. Maybe you can do (again) a patch with a 'fieldsep_csv' or whatever additional variable? Not sure about 'recordsep'... And then we forward to committers both version and they can chose whatever they think best, including not committing anything? -- Fabien.
Re: csv format for psql
On 31/03/18 21:33, Fabien COELHO wrote: Bonjour Daniel, For csv, Fabien and Peter expressed the opinion that we shouldn't create another fieldsep-like variable specifically for it, but instead reuse fieldsep. That's what my latest patch does. Now it turns out that sharing fieldsep comes with some problems. Personnaly I do not have any problem with CSV defaulting to '|' separator, given that anyway people often use anything but a comma for the purpose, including '|'. However Pavel wants to block the patch on this point. Too bad. Maybe you can do (again) a patch with a 'fieldsep_csv' or whatever additional variable? Not sure about 'recordsep'... And then we forward to committers both version and they can chose whatever they think best, including not committing anything? Would suggest the choice of 'fieldsep_csv' out of those two (a record would correspond to the whole row), and the default being a comma. Cheers, Gavin
Re: Planning counters in pg_stat_statements
+1 Shouldn't this be added in next CF ? nb: As plan_time is not included into total_time, could it be added to usage (for statement eviction calculation) ? I will try to include plan_time into my proposed version of pgss with planid. http://www.postgresql-archive.org/Poc-pg-stat-statements-with-planid-td6014027.html Regards PAscal -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: [HACKERS][PATCH] adding simple sock check for windows
On Sat, Mar 31, 2018 at 12:05 PM, CharSyam wrote: > Amit, I agree with you. > > I changed my patch :) to remove old check. > - if (slot->sock < 0) + if (slot->sock == PGINVALID_SOCKET || slot->sock < 0) I still see the same check. I think by mistake you have attached old patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PROPOSAL] Shared Ispell dictionaries
Hello all, I'd like to add new optional function to text search template named fini in addition to init() and lexize(). It will be called by RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will call ts_dict_shmem_release(). It doesn't change segments leaking situation. I think it makes text search API more transparent. I'll update the existing documentation. And I think I can add text search API documentation in the 2018-09 commitfest, as Tom noticed that it doesn't exist. Any thoughts? -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
bulk typos
I needed another distraction so bulk-checked for typos, limited to comments in *.[ch]. I'm not passionate about this, but it serves the purpose of reducing the overhead of fixing them individually. Also I heard something here recently about ugly languages.. time find . -name '*.c' -print0 |xargs -r0 sed -s '/.*\/\*/!d; s///; :l; /\*\/.*/!{N;b l}; s///; s/.*/\L&/' |grep -Eo '[[:alpha:]]{3,}' |sort |uniq -c |sort -nr |awk '$1==1{print $2}' |grep -xFvf /usr/share/dict/words |less If any of these are disputed or objectionable, I would summarily discard them, as I'm sure I missed some and fixing every last typo wasn't really the ghoul. Justin diff --git a/contrib/pgcrypto/rijndael.c b/contrib/pgcrypto/rijndael.c index 4c074ef..6938701 100644 --- a/contrib/pgcrypto/rijndael.c +++ b/contrib/pgcrypto/rijndael.c @@ -163,7 +163,7 @@ gen_tabs(void) q; /* log and power tables for GF(2**8) finite field with */ - /* 0x11b as modular polynomial - the simplest prmitive */ + /* 0x11b as modular polynomial - the simplest primitive */ /* root is 0x11, used here to generate the tables */ for (i = 0, p = 1; i < 256; ++i) diff --git a/src/backend/access/common/session.c b/src/backend/access/common/session.c index 617c3e1..ffa7432 100644 --- a/src/backend/access/common/session.c +++ b/src/backend/access/common/session.c @@ -60,7 +60,7 @@ InitializeSession(void) * Initialize the per-session DSM segment if it isn't already initialized, and * return its handle so that worker processes can attach to it. * - * Unlike the per-context DSM segment, this segement and its contents are + * Unlike the per-context DSM segment, this segment and its contents are * reused for future parallel queries. * * Return DSM_HANDLE_INVALID if a segment can't be allocated due to lack of diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index e85abcf..4011199 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -187,9 +187,9 @@ top: _bt_relbuf(rel, buf); /* -* Something did not workout. Just forget about the cached +* Something did not work out. Just forget about the cached * block and follow the normal path. It might be set again if -* the conditions are favourble. +* the conditions are favourable. */ RelationSetTargetBlock(rel, InvalidBlockNumber); } diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c index 43a27a9..a3fb449 100644 --- a/src/backend/executor/execProcnode.c +++ b/src/backend/executor/execProcnode.c @@ -409,7 +409,7 @@ ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function) * Add a wrapper around the ExecProcNode callback that checks stack depth * during the first execution and maybe adds an instrumentation * wrapper. When the callback is changed after execution has already begun -* that means we'll superflously execute ExecProcNodeFirst, but that seems +* that means we'll superfluously execute ExecProcNodeFirst, but that seems * ok. */ node->ExecProcNodeReal = function; diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c index 0d8c2bd..f37ff82 100644 --- a/src/backend/jit/llvm/llvmjit_expr.c +++ b/src/backend/jit/llvm/llvmjit_expr.c @@ -1768,7 +1768,7 @@ llvm_compile_expr(ExprState *state) b_compare_result, b_null); - /* build block analying the !NULL comparator result */ + /* build block analyzing the !NULL comparator result */ LLVMPositionBuilderAtEnd(b, b_compare_result); /* if results equal, compare next, otherwise done */ diff --git a/src/backend/optimizer/geqo/geqo_misc.c b/src/backend/optimizer/geqo/geqo_misc.c index 919d288..0f96912 100644 --- a/src/backend/optimizer/geqo/geqo_misc.c +++ b/src/backend/optimizer/geqo/geqo_misc.c @@ -92,7 +92,7 @@ print_gen(FILE *fp, Pool *pool, int generation) { int lowest; - /* Get index to lowest ranking gene in poplulation. */ + /* Get index to lowest ranking gene in population. */ /* Use 2nd to last since last is buffer. */ lowest = pool->size > 1 ? pool->size - 2 : 0; diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 60e2f21..e3c5d4
Re: bulk typos
On Sat, Mar 31, 2018 at 12:56 PM, Justin Pryzby wrote: > I needed another distraction so bulk-checked for typos, limited to > comments in > *.[ch]. > I think you introduced another one while changing "explcitly" to "expilcitly" instead of "explicitly" :-) -- Félix
Re: Online enabling of checksums
On Sat, Mar 31, 2018 at 2:08 AM, Tomas Vondra wrote: > Hi, > > I've been looking at the patch a bit more, and I think there are a > couple of fairly serious issues in the error handling. > Thanks! > > Firstly ChecksumHelperLauncherMain spends quite a bit of effort on > skipping dropped databases, but ChecksumHelperWorkerMain does not do the > same thing with tables. I'm not exactly sure why, but I'd say dropped > tables are more likely than dropped databases (e.g. because of temporary > tables) and it's strange to gracefully handle the more rare case. > Uh, yes. I could've sworn we had code for that, but I fully agree with your assessment that it's not there :) Now, when a table gets dropped after BuildRelationList() does it's work, > we end up calling ProcessSingleRelationByOid() on that OID. Which calls > relation_open(), which fails with elog(ERROR), terminating the whole > bgworker with an error like this: > > ERROR: could not open relation with OID 16632 > LOG: background worker "checksumhelper worker" (PID 27152) exited > with exit code 1 > Yeah. We need to trap that error an not crash and burn. Which however means the error handling in ChecksumHelperWorkerMain() has > no chance to kick in, because the bgworker dies right away. The code > looks like this: > > foreach(lc, RelationList) > { > ChecksumHelperRelation *rel > = (ChecksumHelperRelation *) lfirst(lc); > > if (!ProcessSingleRelationByOid(rel->reloid, strategy)) > { > ChecksumHelperShmem->success = ABORTED; > break; > } > else > ChecksumHelperShmem->success = SUCCESSFUL; > } > list_free_deep(RelationList); > > Now, assume the first relation in the list still exists and gets > processed correctly, so "success" ends up being SUCCESSFUL. Then the > second OID is the dropped relation, which kills the bgworker ... > Indeed, that's just a very simple bug. It shouldn't be set to SUCCESSFUL until *all* tables have been processed. I believe the code needs to be this: IMHO this error handling is broken by design - two things need to > happen, I think: (a) graceful handling of dropped relations and (b) > proper error reporting from the bgworder. > > (a) Should not be difficult to do, I think. We don't have relation_open > with a missing_ok flag, but implementing something like that should not > be difficult. Even a simple "does OID exist" should be enough. > Not entirely sure what you mean with "even a simple does oid exist" means? I mean, if we check for the file, that won't help us -- there will still be a tiny race between the check and us opening it won't it? However, we have try_relation_open(). Which is documented as: * Same as relation_open, except return NULL instead of failing * if the relation does not exist. So I'm pretty sure it's just a matter of using try_relation_open() instead, and checking for NULL? (b) But just handling dropped relations is not enough, because I could > simply kill the bgworker directly, and it would have exactly the same > consequences. What needs to happen is something like this: > And now I see your code, which was below-fold when I first read. After having writing a very similar fix myself. I'm glad this code looks mostly identical to what I suggested above, so I think that's a good solution. > BTW I don't think handling dropped relations by letting the bgworker > crash and restart is an acceptable approach. That would pretty much mean > any DDL changes are prohibited on the system while the checksum process > is running, which is not quite possible (e.g. for systems doing stuff > with temporary tables). > No, I don't like that at all. We need to handle them gracefully, by skipping them - crash and restart is not acceptable for something that common. Which however reminds me I've also ran into a bug in the automated retry > system, because you may get messages like this: > > ERROR: failed to enable checksums in "test", giving up (attempts > 639968292). > > This happens because BuildDatabaseList() does just palloc() and does not > initialize the 'attempts' field. It may get initialized to 0 by chance, > but I'm running with -DRANDOMIZE_ALLOCATED_MEMORY, hence the insanely > high value. > Eh. I don't have that "(attempts" part in my code at all. Is that either from some earlier version of the patch, or did you add that yourself for testing? BTW both ChecksumHelperRelation and ChecksumHelperDatabase have > 'success' field which is actually unused (and uninitialized). > Correct. These are old leftovers from the "partial restart" logic, and should be removed. But wait - there is more ;-) BuildRelationList is using heap_beginscan > with the regular snapshot, so it does not see uncommitted transactions. > So if you do this: > > BEGIN; > CREATE TABLE t AS SELECT i FROM generate_series(1,1000) s(i); > -- run pg_enable_data_checksums() fro
Re: Problem while setting the fpw with SIGHUP
On Wed, Mar 28, 2018 at 12:10 PM, Kyotaro HORIGUCHI wrote: > At Tue, 27 Mar 2018 22:02:26 +0900, Michael Paquier > wrote in <20180327130226.ga1...@paquier.xyz> >> On Tue, Mar 27, 2018 at 09:01:20PM +0900, Kyotaro HORIGUCHI wrote: >> > The current UpdateFullPageWrites is safe on standby and promotion >> > so what we should consider is only the non-standby case. I think >> > what we should do is just calling RecoveryInProgress() at the >> > beginning of CheckPointerMain, which is just the same thing with >> > InitPostgres, but before setting up signal handler to avoid >> > processing SIGHUP before being ready to insert xlog. >> >> Your proposal does not fix the issue for a checkpointer process started >> on a standby. After a promotion, if SIGHUP is issued with a change in >> full_page_writes, then the initialization of InitXLogInsert() would >> happen again in the critical section of UpdateFullPageWrites(). The >> window is rather small for normal promotions as the startup process >> requests a checkpoint which would do the initialization, and much larger >> for fallback_promote where the startup process is in charge of doing the >> end-of-recovery checkpoint. > > Yeah. I realized that after sending the mail. > > I looked closer and found several problems there. > > - On standby, StartupXLOG calls UpdateFullPageWrites and > checkpointer can call the same function simultaneously, but it > doesn't assume concurrent call. > > - StartupXLOG can make a concurrent write to > Insert->fullPageWrite so it needs to be locked. > > - At the time of the very end of recovery, the startup process > ignores possible change of full_page_writes GUC. It sticks with > the startup value. It leads to loss of > XLOG_CHANGE_FPW. (StartXLOG is not considering GUC changes by > reload) > Won't this be covered by checkpointer process? Basically, the next time checkpointer sees that it has received the sighup, it will call UpdateFullPageWrites which will log the record if required. In general, I was wondering why in the first place this variable (full_page_writes) is a SIGHUP variable? I think if the user tries to switch it to 'on' from 'off', it won't guarantee the recovery from torn pages. Yeah, one can turn it to 'off' from 'on' without any problem, but as the reverse doesn't guarantee anything, it can confuse users. What do you think? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS][PATCH] adding simple sock check for windows
Thanks Amit. I had a mistake. Thank you again to point it out :) 2018-03-31 19:33 GMT+09:00 Amit Kapila : > On Sat, Mar 31, 2018 at 12:05 PM, CharSyam wrote: >> Amit, I agree with you. >> >> I changed my patch :) to remove old check. >> > > - if (slot->sock < 0) > + if (slot->sock == PGINVALID_SOCKET || slot->sock < 0) > > I still see the same check. I think by mistake you have attached old patch. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com simple_check.patch Description: Binary data
Re: [PATCH] Verify Checksums during Basebackups
Hi, On Fri, Mar 30, 2018 at 07:46:02AM -0400, Stephen Frost wrote: > * Magnus Hagander (mag...@hagander.net) wrote: > > On Fri, Mar 30, 2018 at 5:35 AM, David Steele wrote: > > > > > On 3/24/18 10:32 AM, Michael Banck wrote: > > > > Am Freitag, den 23.03.2018, 17:43 +0100 schrieb Michael Banck: > > > >> Am Freitag, den 23.03.2018, 10:54 -0400 schrieb David Steele: > > > >>> In my experience actual block errors are relatively rare, so there > > > >>> aren't likely to be more than a few in a file. More common are > > > >>> overwritten or transposed files, rogue files, etc. These produce a > > > >>> lot > > > >>> of output. > > > >>> > > > >>> Maybe stop after five? > > > > > > > > The attached patch does that, and outputs the total number of > > > > verification failures of that file after it got sent. > > > > > > > >> I'm on board with this, but I have the feeling that this is not a very > > > >> common pattern in Postgres, or might not be project style at all. I > > > >> can't remember even seen an error message like that. > > > >> > > > >> Anybody know whether we're doing this in a similar fashion elsewhere? > > > > > > > > I tried to have look around and couldn't find any examples, so I'm not > > > > sure that patch should go in. On the other hand, we abort on checksum > > > > failures usually (in pg_dump e.g.), so limiting the number of warnings > > > > does makes sense. > > > > > > > > I guess we need to see what others think. > > > > > > Well, at this point I would say silence more or less gives consent. > > > > > > Can you provide a rebased patch with the validation retry and warning > > > limiting logic added? I would like to take another pass through it but I > > > think this is getting close. > > > > I was meaning to mention it, but ran out of cycles. > > > > I think this is the right way to do it, except the 5 should be a #define > > and not an inline hardcoded value :) We could argue whether it should be "5 > > total" or "5 per file". When I read the emails I thought it was going to be > > 5 total, but I see the implementation does 5 / file. In a super-damanged > > system that will still lead to horrible amounts of logging, but I think > > maybe if your system is in that bad shape, then it's a lost cause anyway. > > 5/file seems reasonable to me as well. > > > I also think the "total number of checksum errors" should be logged if > > they're >0, not >5. And I think *that* one should be logged at the end of > > the entire process, not per file. That'd be the kind of output that would > > be the most interesting, I think (e.g. if I have it spread out with 1 block > > each across 4 files, I want that logged at the end because it's easy to > > otherwise miss one or two of them that may have happened a long time apart). > > I definitely like having a total # of checksum errors included at the > end, if there are any at all. When someone is looking to see why the > process returned a non-zero exit code, they're likely to start looking > at the end of the log, so having that easily available and clear as to > why the backup failed is definitely valuable. > > > I don't think we have a good comparison elsewhere, and that is as David > > mention because other codepaths fail hard when they run into something like > > that. And we explicitly want to *not* fail hard, per previous discussion. > > Agreed. Attached is a new and rebased patch which does the above, plus integrates the suggested changes by David Steele. The output is now: $ initdb -k --pgdata=data1 1> /dev/null 2> /dev/null $ pg_ctl --pgdata=data1 --log=pg1.log start > /dev/null $ dd conv=notrunc oflag=seek_bytes seek=4000 bs=8 count=1 if=/dev/zero of=data1/base/12374/2610 2> /dev/null $ for i in 4000 13000 21000 29000 37000 43000; do dd conv=notrunc oflag=seek_bytes seek=$i bs=8 count=1 if=/dev/zero of=data1/base/12374/1259; done 2> /dev/null $ pg_basebackup -v -h /tmp --pgdata=data2 pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/260 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: created temporary replication slot "pg_basebackup_13882" WARNING: checksum verification failed in file "./base/12374/2610", block 0: calculated C2C9 but expected EC78 WARNING: checksum verification failed in file "./base/12374/1259", block 0: calculated 8BAE but expected 46B8 WARNING: checksum verification failed in file "./base/12374/1259", block 1: calculated E413 but expected 7701 WARNING: checksum verification failed in file "./base/12374/1259", block 2: calculated 5DA9 but expected D5AA WARNING: checksum verification failed in file "./base/12374/1259", block 3: calculated 5651 but expected 4F5E WARNING: checksum verification failed in file "./base/12374/1259", block 4: calculated DF39 but expected DF00 WARNING: further checksum verification failures in file "./base/12374/1259" will not be
Re: [HACKERS][PATCH] adding simple sock check for windows
On Sat, Mar 31, 2018 at 6:08 PM, CharSyam wrote: > Thanks Amit. > I had a mistake. Thank you again to point it out :) > Thanks, your new patch looks good. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Foreign keys and partitioned tables
On 3/29/18 23:19, Alvaro Herrera wrote: > 0001 prohibits having foreign keys pointing to partitioned tables, as > discussed above. This is already prohibited. You get an error ERROR: cannot reference partitioned table "fk_partitioned_pk" Your patch 0001 just adds the same error check a few lines above the existing one, while 0003 removes the existing one. I think you can squash 0001 and 0003 together. > 0002 is a fixup for a bug in the row triggers patch: I had a restriction > earlier that triggers declared internal were not cloned, and I seem to > have lost it in rebase. Reinstate it. Hmm, doesn't cause any test changes? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: hot_standby_feedback vs excludeVacuum and snapshots
On Thu, Mar 29, 2018 at 4:47 PM, Greg Stark wrote: > I'm poking around to see debug a vacuuming problem and wondering if > I've found something more serious. > > As far as I can tell the snapshots on HOT standby are built using a > list of running xids that the primary builds and puts in the WAL and > that seems to include all xids from transactions running in all > databases. The HOT standby would then build a snapshot and eventually > send the xmin of that snapshot back to the primary in the hot standby > feedback and that would block vacuuming tuples that might be visible > to the standby. > > Many ages ago Alvaro sweated blood to ensure vacuums could run for > long periods of time without holding back the xmin horizon and > blocking other vacuums from cleaning up tuples. That's the purpose of > the excludeVacuum flag in GetCurrentVirtualXIDs(). That's possible > because we know vacuums won't insert any tuples that queries might try > to view and also vacuums won't try to perform any sql queries on other > tables. > > I can't find anywhere that the standby snapshot building mechanism > gets this same information about which xids are actually vacuums that > can be ignored when building a snapshot. > I think the vacuum assigns xids only if it needs to truncate some of the pages in the relation which happens towards the end of vacuum. So, it shouldn't hold back the xmin horizon for long. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: WIP: a way forward on bootstrap data
Attached is v13, rebased against b0c90c85fc. Patch 0001: -The data files are formatted to at most 80 columns wide. -Rename rewrite_dat.pl to reformat_dat_file.pl. -Refactor Catalog.pm and reformat_dat_file.pl to have better separation of concerns. -Add src/include/catalog/Makefile with convenience targets for rewriting data files. Patch 0002: -Some adjustments to the post-conversion cleanup of data files. Patch 0005: -I made a stub version of Solution.pm to simulate testing the MSVC build. This found one bug, and also allowed me to bring in some of the more pedantic dependencies I added to utils/Makefile. Patch 0009: -New patch that puts all doc changes in one patch, for flexibility. -Split the parts of catalog/README having to do with data into a new README.data file. Add recipes for how to edit data, with code examples. On 3/26/18, Tom Lane wrote: > It would be more work, but maybe we should move this into the main > SGML docs. It seems rather silly to have SGML documentation for the > .BKI file format, which now will be an internal matter that hardly > any developers need worry about, but not for the .DAT file format. > But I understand if that seems a bridge too far for today --- certainly > a README file is way better than nothing. I had an idea to make it less silly without doing as much work: Get rid of the SGML docs for the BKI format, and turn them into bootstrap/README. Thoughts? And in the department of second thoughts, it occurred to me that the only reason that the .dat files are in include/catalog is because that's where the DATA() statements were. Since they are separate now, one could make the case that they actually belong in backend/catalog. One trivial advantage here is that there is already an existing Makefile in which to put convenience targets for formatting. On the other hand, it kind of makes sense to have the files describing the schema (.h) and the contents (.dat) in the same directory. I'm inclined to leave things as they are for that reason. -John Naylor v13-bootstrap-data-conversion.tar.gz Description: GNU Zip compressed data
Re: Small proposal to improve out-of-memory messages
Tom Lane wrote: > In the wake of commit 442accc3f one might think that the message should > also include the context "identifier" if any. But I'm specifically not > proposing that, because of the point made in the discussion of that patch > that some identifier strings might contain security-sensitive query > excerpts. Now that that commit has required all context "names" to be > compile-time constants, it's hard to see how there could be any security > downside to showing them in a user-visible message. How about using errdetail_log() to include the context identifier? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Feature Request - DDL deployment with logical replication
On 31 March 2018 at 15:53, Nikhil Sontakke wrote: > Hi Jeremy, > > > My whole point is that in most architectures, DBAs decide to deploy the > same > > SQL on providers and subscribers. Yes it isn't perfect, but IMO, it is > very > > helpful to try to automate that idea, as opposed to trying to actually > > replicate DDL at the low level. The latter is better, yes, but seems to > > have proven extremely difficult. Hence, why you see the advent of > functions > > to pipe DDL through the replication stream. > > > > The community is currently working on in the current commitfest to try > and get logical decoding of 2PC in into the core. > > Once something like that gets in, for a majority of subset of DDLs > (which works inside transaction blocks), one of the use cases of that > functionality could be to trap these DDLs and convert them into > implicit 2PC commands. Details need to be worked out, but we would get > all the logical replication cluster nodes in sync with each other and > issue a PREPARE transaction involving this DDL on all nodes in the > logical replication cluster. If any of the nodes is not able to > successfully prepare this DDL, then we can rollback or else commit the > 2PC, thus moving the entire logical cluster consistently in terms of > schema changes. > > We'll still need a mechanism to transport them to downstreams (like WAL messages) and to send responses upstream. For responses I think we will finally want to add a backchannel to the logical replication protocol as I've wanted for a long while: downstream can send a COPY message on COPY BOTH proto back to upstream, which passes it to a callback on the output plugin for the output plugin to act on. The main issue I had when I tried to prototype this before was IIRC not knowing how to set up the right snapshot in which to execute the callback. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Online enabling of checksums
Hi, On 03/31/2018 02:02 PM, Magnus Hagander wrote: > On Sat, Mar 31, 2018 at 2:08 AM, Tomas Vondra > mailto:tomas.von...@2ndquadrant.com>> wrote: > > ... > > (a) Should not be difficult to do, I think. We don't have relation_open > with a missing_ok flag, but implementing something like that should not > be difficult. Even a simple "does OID exist" should be enough. > > > Not entirely sure what you mean with "even a simple does oid exist" > means? I mean, if we check for the file, that won't help us -- there > will still be a tiny race between the check and us opening it won't it? > I meant to say "even a simple check if the OID still exists" but it was a bit too late / not enough caffeine issue. You're right there would be a tiny window of race condition - it'd be much shorter, possibly enough to make the error+restart approach acceptable. > However, we have try_relation_open(). Which is documented as: > *Same as relation_open, except return NULL instead of failing > *if the relation does not exist. > > So I'm pretty sure it's just a matter of using try_relation_open() > instead, and checking for NULL? > Oh, right. I thought we had a relation_open variant that handles this, but have been looking for one with missing_ok flag and so I missed this. try_relation_open should do the trick when it comes to dropped tables. > > (b) But just handling dropped relations is not enough, because I could > simply kill the bgworker directly, and it would have exactly the same > consequences. What needs to happen is something like this: > > > > And now I see your code, which was below-fold when I first read. After > having writing a very similar fix myself. I'm glad this code looks > mostly identical to what I suggested above, so I think that's a good > solution. > Good ;-) > > > BTW I don't think handling dropped relations by letting the bgworker > crash and restart is an acceptable approach. That would pretty much mean > any DDL changes are prohibited on the system while the checksum process > is running, which is not quite possible (e.g. for systems doing stuff > with temporary tables). > > > No, I don't like that at all. We need to handle them gracefully, by > skipping them - crash and restart is not acceptable for something that > common. > Yeah, I was just thinking aloud. > > Which however reminds me I've also ran into a bug in the automated retry > system, because you may get messages like this: > > ERROR: failed to enable checksums in "test", giving up (attempts > 639968292). > > This happens because BuildDatabaseList() does just palloc() and does not > initialize the 'attempts' field. It may get initialized to 0 by chance, > but I'm running with -DRANDOMIZE_ALLOCATED_MEMORY, hence the insanely > high value. > > > Eh. I don't have that "(attempts" part in my code at all. Is that either > from some earlier version of the patch, or did you add that yourself for > testing? > Apologies, you're right I tweaked the message a bit (just adding the number of attempts to it). The logic however remains the same, and the bug is real. > > But wait - there is more ;-) BuildRelationList is using heap_beginscan > with the regular snapshot, so it does not see uncommitted transactions. > So if you do this: > > BEGIN; > CREATE TABLE t AS SELECT i FROM generate_series(1,1000) s(i); > -- run pg_enable_data_checksums() from another session > SELECT COUNT(*) FROM t; > > then the table will be invisible to the checksum worker, it won't have > checksums updated and the cluster will get checksums enabled. Which > means this: > > > Ugh. Interestingly enough I just put that on my TODO list *yesterday* > that I forgot to check that specific case :/ > But I was faster in reporting it ;-) > > test=# SELECT COUNT(*) FROM t; > WARNING: page verification failed, calculated checksum 27170 but > expected 0 > ERROR: invalid page in block 0 of relation base/16677/16683 > > Not sure what's the best way to fix this - maybe we could wait for all > running transactions to end, before starting the work. > > > I was thinking of that as one way to deal with it, yes. > > I guess a reasonable way to do that would be to do it as part of > BuildRelationList() -- basically have that one wait until there is no > other running transaction in that specific database before we take the > snapshot? > > A first thought I had was to try to just take an access exclusive lock > on pg_class for a very short time, but a transaction that does create > table doesn't actually keep it's lock on that table so there is no conflict. > Yeah, I don't think that's going to work, because you don't even know you need to lock/wait for something. I do think just waiting for all running transactions to complete is fine, and it's not the first place wher
Re: Parallel Aggregates for string_agg and array_agg
On 30 March 2018 at 02:55, Tomas Vondra wrote: > On 03/29/2018 03:09 PM, David Rowley wrote: >> I meant to mention earlier that I coded >> agg_args_have_sendreceive_funcs() to only check for send/receive >> functions. Really we could allow a byval types without send/receive >> functions, since the serial/deserial just send the raw datums in that >> case, but then the function becomes >> agg_byref_args_have_sendreceive_funcs(), which seemed a bit obscure, >> so I didn't do that. Maybe I should? > > I'd do that. Not sure the function name needs to change, but perhaps > agg_args_support_sendreceive() would be better - it covers both byref > types (which require send/receive functions) and byval (which don't). The attached patch implements this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services combinefn_for_string_and_array_aggs_v10.patch Description: Binary data
Re: Online enabling of checksums
On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra wrote: > > On 03/31/2018 02:02 PM, Magnus Hagander wrote: > > On Sat, Mar 31, 2018 at 2:08 AM, Tomas Vondra > > mailto:tomas.von...@2ndquadrant.com>> > wrote: > > > > > > But wait - there is more ;-) BuildRelationList is using > heap_beginscan > > with the regular snapshot, so it does not see uncommitted > transactions. > > So if you do this: > > > > BEGIN; > > CREATE TABLE t AS SELECT i FROM generate_series(1,1000) s(i); > > -- run pg_enable_data_checksums() from another session > > SELECT COUNT(*) FROM t; > > > > then the table will be invisible to the checksum worker, it won't > have > > checksums updated and the cluster will get checksums enabled. Which > > means this: > > > > > > Ugh. Interestingly enough I just put that on my TODO list *yesterday* > > that I forgot to check that specific case :/ > > > > But I was faster in reporting it ;-) > Indeed you were :) > > test=# SELECT COUNT(*) FROM t; > > WARNING: page verification failed, calculated checksum 27170 but > > expected 0 > > ERROR: invalid page in block 0 of relation base/16677/16683 > > > > Not sure what's the best way to fix this - maybe we could wait for > all > > running transactions to end, before starting the work. > > > > > > I was thinking of that as one way to deal with it, yes. > > > > I guess a reasonable way to do that would be to do it as part of > > BuildRelationList() -- basically have that one wait until there is no > > other running transaction in that specific database before we take the > > snapshot? > > > > A first thought I had was to try to just take an access exclusive lock > > on pg_class for a very short time, but a transaction that does create > > table doesn't actually keep it's lock on that table so there is no > conflict. > > > > Yeah, I don't think that's going to work, because you don't even know > you need to lock/wait for something. > I do think just waiting for all running transactions to complete is > fine, and it's not the first place where we use it - CREATE SUBSCRIPTION > does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY > too, to some extent). So we have a precedent / working code we can copy. > Thinking again, I don't think it should be done as part of BuildRelationList(). We should just do it once in the launcher before starting, that'll be both easier and cleaner. Anything started after that will have checksums on it, so we should be fine. PFA one that does this. > And if you try this with a temporary table (not hidden in transaction, > > so the bgworker can see it), the worker will fail with this: > > > > ERROR: cannot access temporary tables of other sessions > > > > But of course, this is just another way how to crash without updating > > the result for the launcher, so checksums may end up being enabled > > anyway. > > > > > > Yeah, there will be plenty of side-effect issues from that > > crash-with-wrong-status case. Fixing that will at least make things > > safer -- in that checksums won't be enabled when not put on all pages. > > > > Sure, the outcome with checksums enabled incorrectly is a consequence of > bogus status, and fixing that will prevent that. But that wasn't my main > point here - not articulated very clearly, though. > > The bigger question is how to handle temporary tables gracefully, so > that it does not terminate the bgworker like this at all. This might be > even bigger issue than dropped relations, considering that temporary > tables are pretty common part of applications (and it also includes > CREATE/DROP). > > For some clusters it might mean the online checksum enabling would > crash+restart infinitely (well, until reaching MAX_ATTEMPTS). > > Unfortunately, try_relation_open() won't fix this, as the error comes > from ReadBufferExtended. And it's not a matter of simply creating a > ReadBuffer variant without that error check, because temporary tables > use local buffers. > > I wonder if we could just go and set the checksums anyway, ignoring the > local buffers. If the other session does some changes, it'll overwrite > our changes, this time with the correct checksums. But it seems pretty > dangerous (I mean, what if they're writing stuff while we're updating > the checksums? Considering the various short-cuts for temporary tables, > I suspect that would be a boon for race conditions.) > > Another option would be to do something similar to running transactions, > i.e. wait until all temporary tables (that we've seen at the beginning) > disappear. But we're starting to wait on more and more stuff. > > If we do this, we should clearly log which backends we're waiting for, > so that the admins can go and interrupt them manually. > Yeah, waiting for all transactions at the beginning is pretty simple. Making the worker simply ignore temporary tables would also be easy. One of
Re: Small proposal to improve out-of-memory messages
Alvaro Herrera writes: > Tom Lane wrote: >> In the wake of commit 442accc3f one might think that the message should >> also include the context "identifier" if any. But I'm specifically not >> proposing that, because of the point made in the discussion of that patch >> that some identifier strings might contain security-sensitive query >> excerpts. Now that that commit has required all context "names" to be >> compile-time constants, it's hard to see how there could be any security >> downside to showing them in a user-visible message. > How about using errdetail_log() to include the context identifier? Not really excited about that; we have no field experience to say it'd be useful. If we start finding ourselves asking "exactly which ExprState was that", we could revisit the question perhaps. Furthermore, because elog.c constructs the whole detail string in ErrorContext, doing this would create a significant OOM hazard anytime the context ID didn't fit into the preallocated ErrorContext space. Context names are generally short enough that they're not adding to our risk there, but the ID strings could be very long in some cases. (mcxt.c avoids this hazard by writing directly to stderr, but elog.c can't do that.) regards, tom lane
Re: csv format for psql
On 31 March 2018 at 04:33, Fabien COELHO wrote: > > Bonjour Daniel, > > For csv, Fabien and Peter expressed the opinion that we shouldn't >> create another fieldsep-like variable specifically for it, but instead >> reuse fieldsep. That's what my latest patch does. >> >> Now it turns out that sharing fieldsep comes with some problems. >> > > Personnaly I do not have any problem with CSV defaulting to '|' separator, > given that anyway people often use anything but a comma for the purpose, > including '|'. > > However Pavel wants to block the patch on this point. Too bad. > OK, mostly trying to avoid commenting because I doubt I have much to add. But. If I ask for CSV and don't specify any overrides, I expect to get "C"omma separated values, not some other character. More specifically, if I say --csv I expect to get files that are identical with what I would get if I used COPY ... CSV. Actually, COPY ... CSV HEADER, given that psql shows column headings. This also implies that I expect the quoting and related details that are associated with CSV. And I don't think I'm a weird user. If --csv does anything even a little different from a simple COPY invocation on the same query, at some point it's going to bite somebody and they will rightfully curse the design decisions taken in this thread.
Re: WIP: a way forward on bootstrap data
John Naylor writes: > And in the department of second thoughts, it occurred to me that the > only reason that the .dat files are in include/catalog is because > that's where the DATA() statements were. Since they are separate now, > one could make the case that they actually belong in backend/catalog. > One trivial advantage here is that there is already an existing > Makefile in which to put convenience targets for formatting. On the > other hand, it kind of makes sense to have the files describing the > schema (.h) and the contents (.dat) in the same directory. I'm > inclined to leave things as they are for that reason. Yeah. The fact that, eg, both the .h and .dat files are inputs to duplicate_oids and unused_oids makes me think it's better to keep them together. I'd actually been thinking of something that's about the reverse: instead of building the derived .h files in backend/catalog and then symlinking them into include/catalog, it'd be saner to build them in include/catalog to begin with. However, that would mean that the Perl scripts need to produce output in two different places, so maybe it'd end up more complicated not less so. In any case, that seems like something to leave for another day. regards, tom lane
Re: Online enabling of checksums
On 03/31/2018 05:05 PM, Magnus Hagander wrote: > On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra > mailto:tomas.von...@2ndquadrant.com>> wrote: > > ... > > I do think just waiting for all running transactions to complete is > fine, and it's not the first place where we use it - CREATE SUBSCRIPTION > does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY > too, to some extent). So we have a precedent / working code we can copy. > > > Thinking again, I don't think it should be done as part of > BuildRelationList(). We should just do it once in the launcher before > starting, that'll be both easier and cleaner. Anything started after > that will have checksums on it, so we should be fine. > > PFA one that does this. > Seems fine to me. I'd however log waitforxid, not the oldest one. If you're a DBA and you want to make the checksumming to proceed, knowing the oldest running XID is useless for that. If we log waitforxid, it can be used to query pg_stat_activity and interrupt the sessions somehow. > > > And if you try this with a temporary table (not hidden in > transaction, > > so the bgworker can see it), the worker will fail with this: > > > > ERROR: cannot access temporary tables of other sessions > > > > But of course, this is just another way how to crash without > updating > > the result for the launcher, so checksums may end up being enabled > > anyway. > > > > > > Yeah, there will be plenty of side-effect issues from that > > crash-with-wrong-status case. Fixing that will at least make things > > safer -- in that checksums won't be enabled when not put on all pages. > > > > Sure, the outcome with checksums enabled incorrectly is a consequence of > bogus status, and fixing that will prevent that. But that wasn't my main > point here - not articulated very clearly, though. > > The bigger question is how to handle temporary tables gracefully, so > that it does not terminate the bgworker like this at all. This might be > even bigger issue than dropped relations, considering that temporary > tables are pretty common part of applications (and it also includes > CREATE/DROP). > > For some clusters it might mean the online checksum enabling would > crash+restart infinitely (well, until reaching MAX_ATTEMPTS). > > Unfortunately, try_relation_open() won't fix this, as the error comes > from ReadBufferExtended. And it's not a matter of simply creating a > ReadBuffer variant without that error check, because temporary tables > use local buffers. > > I wonder if we could just go and set the checksums anyway, ignoring the > local buffers. If the other session does some changes, it'll overwrite > our changes, this time with the correct checksums. But it seems pretty > dangerous (I mean, what if they're writing stuff while we're updating > the checksums? Considering the various short-cuts for temporary tables, > I suspect that would be a boon for race conditions.) > > Another option would be to do something similar to running transactions, > i.e. wait until all temporary tables (that we've seen at the beginning) > disappear. But we're starting to wait on more and more stuff. > > If we do this, we should clearly log which backends we're waiting for, > so that the admins can go and interrupt them manually. > > > > Yeah, waiting for all transactions at the beginning is pretty simple. > > Making the worker simply ignore temporary tables would also be easy. > > One of the bigger issues here is temporary tables are *session* scope > and not transaction, so we'd actually need the other session to finish, > not just the transaction. > > I guess what we could do is something like this: > > 1. Don't process temporary tables in the checksumworker, period. > Instead, build a list of any temporary tables that existed when the > worker started in this particular database (basically anything that we > got in our scan). Once we have processed the complete database, keep > re-scanning pg_class until those particular tables are gone (search by oid). > > That means that any temporary tables that are created *while* we are > processing a database are ignored, but they should already be receiving > checksums. > > It definitely leads to a potential issue with long running temp tables. > But as long as we look at the *actual tables* (by oid), we should be > able to handle long-running sessions once they have dropped their temp > tables. > > Does that sound workable to you? > Yes, that's pretty much what I meant by 'wait until all temporary tables disappear'. Again, we need to make it easy to determine which OIDs are we waiting for, which sessions may need DBA's attention. I don't think it makes sense to log OIDs of the temporary tables. There can be many of them, and in most cases the connection/sess
Re: [HACKERS] Runtime Partition Pruning
David Rowley wrote: > To put the new patch to the test, I tried pgbench -S -M prepared -s > 100 with and without having modified pgbench_accounts to separate into > 10 RANGE partitions of equal size. > > A non-partitioned table was getting 12503 TPS. > With partitioned tables, the old version of this patch was getting: 5470 TPS. > With partitioned tables, the attached version gets 11247 TPS. > For perspective, today's master with a partitioned table gets 4719 TPS. > > So you can see it's a pretty good performance boost by skipping > initialisation of the 9 non-matching subplans. It's not hard to > imagine the gains getting more significant with a larger number of > partitions. These are excellent news! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PROPOSAL] Shared Ispell dictionaries
On 03/31/2018 12:42 PM, Arthur Zakirov wrote: > Hello all, > > I'd like to add new optional function to text search template named fini > in addition to init() and lexize(). It will be called by > RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will > call ts_dict_shmem_release(). > > It doesn't change segments leaking situation. I think it makes text > search API more transparent. > If it doesn't actually solve the problem, why add it? I don't see a point in adding functions for the sake of transparency, when it does not in fact serve any use cases. Can't we handle the segment-leaking by adding some sort of tombstone? For example, imagine that instead of removing the hash table entry we mark it as 'dropped'. And after that, after the lookup we would know the dictionary was removed, and the backends would load the dictionary into their private memory. Of course, this could mean we end up having many tombstones in the hash table. But those tombstones would be tiny, making it less painful than potentially leaking much more memory for the dictionaries. Also, I wonder if we might actually remove the dictionaries after a while, e.g. based on XID. Imagine that we note the XID of the transaction removing the dictionary, or perhaps XID of the most recent running transaction. Then we could use this to decide if all running transactions actually see the DROP, and we could remove the tombstone. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 31 March 2018 at 21:24, Anthony Iliopoulos wrote: > On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote: > > > >> Yeah, I see why you want to PANIC. > > > > > > Indeed. Even doing that leaves question marks about all the kernel > > > versions before v4.13, which at this point is pretty much everything > > > out there, not even detecting this reliably. This is messy. > > There may still be a way to reliably detect this on older kernel > versions from userspace, but it will be messy whatsoever. On EIO > errors, the kernel will not restore the dirty page flags, but it > will flip the error flags on the failed pages. One could mmap() > the file in question, obtain the PFNs (via /proc/pid/pagemap) > and enumerate those to match the ones with the error flag switched > on (via /proc/kpageflags). This could serve at least as a detection > mechanism, but one could also further use this info to logically > map the pages that failed IO back to the original file offsets, > and potentially retry IO just for those file ranges that cover > the failed pages. Just an idea, not tested. > That sounds like a huge amount of complexity, with uncertainty as to how it'll behave kernel-to-kernel, for negligble benefit. I was exploring the idea of doing selective recovery of one relfilenode, based on the assumption that we know the filenode related to the fd that failed to fsync(). We could redo only WAL on that relation. But it fails the same test: it's too complex for a niche case that shouldn't happen in the first place, so it'll probably have bugs, or grow bugs in bitrot over time. Remember, if you're on ext4 with errors=remount-ro, you get shut down even harder than a PANIC. So we should just use the big hammer here. I'll send a patch this week. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Craig Ringer writes: > So we should just use the big hammer here. And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed. regards, tom lane
Re: Feature Request - DDL deployment with logical replication
On 2018-03-31 22:13:42 +0800, Craig Ringer wrote: > We'll still need a mechanism to transport them to downstreams (like WAL > messages) and to send responses upstream. For responses I think we will > finally want to add a backchannel to the logical replication protocol as > I've wanted for a long while: downstream can send a COPY message on COPY > BOTH proto back to upstream, which passes it to a callback on the output > plugin for the output plugin to act on. Not necessarily? You can just send out the prepare, wait for all clients to ack it, and then commit/rollback prepared. Greetings, Andres Freund
Re: Foreign keys and partitioned tables
On 3/29/18 23:19, Alvaro Herrera wrote: > 0003 is the matter of interest. This is essentially the same code I > posted earlier, rebased to the committed row triggers patch, with a few > minor bug fixes and some changes in the regression tests to try and make > them more comprehensive, including leaving a partitioned table with an > FK to test that the whole pg_dump thing works via the pg_upgrade test. I've only read the tests so far. The functionality appears to work correctly. It's a bit strange how the tests are split up between alter_table.sql and foreign_key.sql, especially since the latter also contains ALTER TABLE checks and vice versa. Some tests are a bit redundant, e.g., this in alter_table.sql: +-- these fail: +INSERT INTO at_partitioned VALUES (1000, 42); +ERROR: insert or update on table "at_partitioned_0" violates foreign key constraint "at_partitioned_reg1_col1_fkey" +DETAIL: Key (reg1_col1)=(42) is not present in table "at_regular1". and +INSERT INTO at_partitioned VALUES (5000, 42); +ERROR: insert or update on table "at_partitioned_0" violates foreign key constraint "at_partitioned_reg1_col1_fkey" +DETAIL: Key (reg1_col1)=(42) is not present in table "at_regular1". There are no documentation changes. The foreign key section in CREATE TABLE does not contain anything about partitioned tables, which is probably an existing omission, but it might be good to fix this up now. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS][PATCH] adding simple sock check for windows
CharSyam writes: > [ simple_check.patch ] This is a good catch. However, it looks to me like the reason nobody has noticed a problem here is that actually, this error test is useless; we can never get here with a bad connection, because connectDatabase would have failed. I'm inclined to think we should just delete the whole if-statement, instead. BTW, I tried cross-checking for other errors of this ilk by doing diff --git a/src/include/port.h b/src/include/port.h index a514ab758b..9ba6a0df79 100644 --- a/src/include/port.h +++ b/src/include/port.h @@ -28,9 +28,9 @@ /* socket has a different definition on WIN32 */ #ifndef WIN32 -typedef int pgsocket; +typedef unsigned int pgsocket; -#define PGINVALID_SOCKET (-1) +#define PGINVALID_SOCKET ((pgsocket) -1) #else typedef SOCKET pgsocket; and then compiling with a compiler that would warn about "unsigned int < 0" tests (I used Apple's current compiler, but probably any recent clang would do as well). It found this case and no others, which is good. This is not a 100% cross-check, because I don't think it would notice "unsigned int == -1" which is another way somebody might misspell the test, and it definitely would not catch cases where somebody stored a socket identifier in plain int instead of pgsocket. But it's better than nothing... regards, tom lane
Re: [HACKERS][PATCH] adding simple sock check for windows
Hi, Tom, Thank you for your review. so Do you think it is better to remove if statement like below diff --git src/bin/scripts/vacuumdb.c src/bin/scripts/vacuumdb.c index 887fa48fbd..243d842d06 100644 --- src/bin/scripts/vacuumdb.c +++ src/bin/scripts/vacuumdb.c @@ -947,13 +947,6 @@ init_slot(ParallelSlot *slot, PGconn *conn, const char *progname) slot->connection = conn; slot->isFree = true; slot->sock = PQsocket(conn); - -if (slot->sock < 0) -{ -fprintf(stderr, _("%s: invalid socket: %s"), progname, -PQerrorMessage(conn)); -exit(1); -} } static void 2018-04-01 2:16 GMT+09:00 Tom Lane : > CharSyam writes: >> [ simple_check.patch ] > > This is a good catch. However, it looks to me like the reason nobody > has noticed a problem here is that actually, this error test is useless; > we can never get here with a bad connection, because connectDatabase > would have failed. I'm inclined to think we should just delete the whole > if-statement, instead. > > BTW, I tried cross-checking for other errors of this ilk by doing > > diff --git a/src/include/port.h b/src/include/port.h > index a514ab758b..9ba6a0df79 100644 > --- a/src/include/port.h > +++ b/src/include/port.h > @@ -28,9 +28,9 @@ > > /* socket has a different definition on WIN32 */ > #ifndef WIN32 > -typedef int pgsocket; > +typedef unsigned int pgsocket; > > -#define PGINVALID_SOCKET (-1) > +#define PGINVALID_SOCKET ((pgsocket) -1) > #else > typedef SOCKET pgsocket; > > and then compiling with a compiler that would warn about "unsigned int < 0" > tests (I used Apple's current compiler, but probably any recent clang > would do as well). It found this case and no others, which is good. > This is not a 100% cross-check, because I don't think it would notice > "unsigned int == -1" which is another way somebody might misspell the > test, and it definitely would not catch cases where somebody stored a > socket identifier in plain int instead of pgsocket. But it's better > than nothing... > > regards, tom lane remove_if_stat.patch Description: Binary data
some last patches breaks plan cache
Hi CREATE OR REPLACE PROCEDURE public.proc(a integer, INOUT b integer, c integer) LANGUAGE plpgsql AS $procedure$ begin b := a + c; end; $procedure$ CREATE OR REPLACE PROCEDURE public.testproc() LANGUAGE plpgsql AS $procedure$ declare r int; begin call proc(10, r, 20); end; $procedure$ postgres=# call testproc(); CALL postgres=# call testproc(); ERROR: SPI_execute_plan_with_paramlist failed executing query "CALL proc(10, r, 20)": SPI_ERROR_ARGUMENT CONTEXT: PL/pgSQL function testproc() line 4 at CALL postgres=# second call fails Regards Pavel
Re: some last patches breaks plan cache
On 03/31/2018 07:38 PM, Pavel Stehule wrote: > Hi > > CREATE OR REPLACE PROCEDURE public.proc(a integer, INOUT b integer, c > integer) > LANGUAGE plpgsql > AS $procedure$ > begin > b := a + c; > end; > $procedure$ > > CREATE OR REPLACE PROCEDURE public.testproc() > LANGUAGE plpgsql > AS $procedure$ > declare r int; > begin > call proc(10, r, 20); > end; > $procedure$ > > postgres=# call testproc(); > CALL > postgres=# call testproc(); > ERROR: SPI_execute_plan_with_paramlist failed executing query "CALL > proc(10, r, 20)": SPI_ERROR_ARGUMENT > CONTEXT: PL/pgSQL function testproc() line 4 at CALL > postgres=# > > second call fails Yeah. d92bc83c48bdea9888e64cf1e2edbac9693099c9 seems to have broken this :-/ regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS][PATCH] adding simple sock check for windows
... Oh, just to make things even more fun, PQsocket() returns int, not pgsocket; see its header comment. Therefore, that test is correctly coded as-is (though it's still useless), and the real problem is that ParallelSlot.sock ought to be declared int not pgsocket. If you look around at our other places that read PQsocket() and use its result in select() masks, they uniformly use int variables, and I think they're right. Actually, the more I look at this code, the more problems I'm finding. The wait-for-a-free-connection code is just broken on its face, because it supposes it can retry select() waits without reinitializing the FD_SET. The loop in vacuum_one_database() uses "conn" to call prepare_vacuum_command() without any thought as to whether that is a free connection --- which, typically, it isn't. It looks to me like libpq silently copes with this silliness, just waiting for the active command to finish and then doing what it's told, but the net result is that supposedly-concurrent vacuums get serialized. regards, tom lane
Re: [HACKERS] A design for amcheck heapam verification
On Fri, Mar 30, 2018 at 7:04 PM, Andres Freund wrote: > I'm just saying that there should be two functions here, rather than dropping > the old definition, and creating s new one with a default argument. So you're asking for something like bt_index_check_heap() + bt_index_parent_check_heap()? Or, are you talking about function overloading? -- Peter Geoghegan
Re: some last patches breaks plan cache
On 03/31/2018 07:56 PM, Tomas Vondra wrote: > On 03/31/2018 07:38 PM, Pavel Stehule wrote: >> Hi >> >> CREATE OR REPLACE PROCEDURE public.proc(a integer, INOUT b integer, c >> integer) >> LANGUAGE plpgsql >> AS $procedure$ >> begin >> b := a + c; >> end; >> $procedure$ >> >> CREATE OR REPLACE PROCEDURE public.testproc() >> LANGUAGE plpgsql >> AS $procedure$ >> declare r int; >> begin >> call proc(10, r, 20); >> end; >> $procedure$ >> >> postgres=# call testproc(); >> CALL >> postgres=# call testproc(); >> ERROR: SPI_execute_plan_with_paramlist failed executing query "CALL >> proc(10, r, 20)": SPI_ERROR_ARGUMENT >> CONTEXT: PL/pgSQL function testproc() line 4 at CALL >> postgres=# >> >> second call fails > > Yeah. > > d92bc83c48bdea9888e64cf1e2edbac9693099c9 seems to have broken this :-/ > FWIW it seems the issue is somewhere in exec_stmt_call, which does this: /* * Don't save the plan if not in atomic context. Otherwise, * transaction ends would cause warnings about plan leaks. */ exec_prepare_plan(estate, expr, 0, estate->atomic); When executed outside transaction, CALL has estate->atomic=false, and so calls exec_prepare_plan() with keepplan=false. And on the second call it gets bogus Plan, of course (with the usual 0x7f7f7f7f7f7f7f7f patterns). When in a transaction, it sets keepplan=true, and everything works fine. So either estate->atomic is not sufficient on it's own, or we need to reset the expr->plan somewhere. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] [PATCH] Incremental sort
Hi, I've been doing a bit more review of the patch today, focusing on the planner part, and I'm starting to have some doubts regarding the way incremental sort paths are created. I do have some question about the executor and other parts too. I'll mark this as 'waiting on author' to make it clear the patch is still being discussed, RFC is not appropriate status for that. Attached is a patch that highlights some of the interesting places, and also suggests some minor changes to comments and other tweaks. 1) planning/costing of incremental vs. non-incremental sorts In sort, all the various places that create/cost sorts: * createplan.c (make_sort) * planner.c (create_sort_path) * pathnode.c (cost_sort) seem to prefer incremental sorts whenever available. Consider for example this code from create_merge_append_plan(): if (!pathkeys_common_contained_in(pathkeys, subpath->pathkeys, &n_common_pathkeys)) { Sort *sort = make_sort(subplan, numsortkeys, n_common_pathkeys, sortColIdx, sortOperators, collations, nullsFirst); label_sort_with_costsize(root, sort, best_path->limit_tuples); subplan = (Plan *) sort; } This essentially says that when (n_common_pathkeys > 0), the sort is going to be incremental. That however seems to rely on an important assumption - when the input is presorted, the incremental sort is expected to be cheaper than regular sort. This assumption however seems to be proven invalid by cost_sort, which does the common part for both sort modes (incremental/non-incremental) first, and then does this: /* Extra costs of incremental sort */ if (presorted_keys > 0) { ... add something to the costs ... } That is, the incremental cost seems to be pretty much guaranteed to be more expensive than regular Sort (with the exception of LIMIT queries, where it's guaranteed to win thanks to lower startup cost). I don't know how significant the cost difference may be (perhaps not much), or if it may lead to inefficient plans. For example what if the cheapest total path is partially sorted by chance, and only has a single prefix group? Then all the comparisons with pivotSlot are unnecessary. But I'm pretty sure it may lead to surprising behavior - for example if you disable incremental sorts (enable_incrementalsort=off), the plan will switch to plain sort without the additional costs. So you'll get a cheaper plan by disabling some operation. That's surprising. So I think it would be more appropriate if those places actually did a costing of incremental vs. non-incremental sorts, and then constructed the cheaper option. Essentially we should consider both plain and incremental sort for each partially sorted input path, and then pick the right one. Of course, this is going to be tricky in createplan.c which builds the plans directly - in that case it might be integrated into make_sort() or something like that. Also, I wonder if we could run into problems, due to incremental sort not supporting things the regular sort does (rewind, backwards scans and mark/restore). 2) nodeIncrementalsort.c There's a couple of obsolete comments, that came from nodeSort.c and did not get tweaked (and so talk about first-time through when incremental sort needs to do that for each group, etc.). The attached diff tweaks those, and clarifies a couple of others. I've also added some comments explaining what the pivotSlot is about etc. There's also a couple of XXX comments asking additional questions/clarifications. I'm wondering if a static MIN_GROUP_SIZE is good idea. For example, what if the subplan is expected to return only very few tuples (say, 33), but the query includes LIMIT 1. Now, let's assume the startup/total cost of the subplan is 1 and 100. With MIN_GROUP_SIZE 32 we're bound to execute it pretty much till the end, while we could terminate after the first tuple (if the prefix changes). So I think we should use a Min(limit,MIN_GROUP_SIZE) here, and perhaps this should depend on average group size too. The other questionable thing seems to be this claim: * We unlikely will have huge groups with incremental sort. Therefore * usage of abbreviated keys would be likely a waste of time. followed by disabling abbreviated keys in the tuplesort_begin_heap call. I find this rather dubious and unsupported by any arguments (I certainly don't see any in the comments). If would be more acceptable if the estimated number of groups was used when deciding whether to use incremental sort or not, but that's not the case - as explained in the first part, we simply prefer incremental sorts whenever there is a prefix. In those cases we have very little idea (or even guarantees) regarding the average group size. Furt
Re: [HACKERS] [PATCH] Incremental sort
On 03/31/2018 10:43 PM, Tomas Vondra wrote: > ... > But I'm pretty sure it may lead to surprising behavior - for example if > you disable incremental sorts (enable_incrementalsort=off), the plan > will switch to plain sort without the additional costs. So you'll get a > cheaper plan by disabling some operation. That's surprising. > To illustrate this is a valid issue, consider this trivial example: create table t (a int, b int, c int); insert into t select 10*random(), 10*random(), 10*random() from generate_series(1,100) s(i); analyze t; explain select * from (select * from t order by a,b) foo order by a,b,c; QUERY PLAN Incremental Sort (cost=133100.48..264139.27 rows=100 width=12) Sort Key: t.a, t.b, t.c Presorted Key: t.a, t.b -> Sort (cost=132154.34..134654.34 rows=100 width=12) Sort Key: t.a, t.b -> Seq Scan on t (cost=0.00..15406.00 rows=100 width=12) (6 rows) set enable_incrementalsort = off; explain select * from (select * from t order by a,b) foo order by a,b,c; QUERY PLAN Sort (cost=261402.69..263902.69 rows=100 width=12) Sort Key: t.a, t.b, t.c -> Sort (cost=132154.34..134654.34 rows=100 width=12) Sort Key: t.a, t.b -> Seq Scan on t (cost=0.00..15406.00 rows=100 width=12) (5 rows) So the cost with incremental sort was 264139, and after disabling the incremental cost it dropped to 263902. Granted, the difference is negligible in this case, but it's still surprising. Also, it can be made much more significant by reducing the number of prefix groups in the data: truncate t; insert into t select 1,1,1 from generate_series(1,100) s(i); analyze t; set enable_incrementalsort = on; explain select * from (select * from t order by a,b) foo order by a,b,c; QUERY PLAN Incremental Sort (cost=324165.83..341665.85 rows=100 width=12) Sort Key: t.a, t.b, t.c Presorted Key: t.a, t.b -> Sort (cost=132154.34..134654.34 rows=100 width=12) Sort Key: t.a, t.b -> Seq Scan on t (cost=0.00..15406.00 rows=100 width=12) (6 rows) So that's 263902 vs. 341665, yet we still prefer the incremental mode. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PROPOSAL] Shared Ispell dictionaries
Tomas Vondra wrote: > > On 03/31/2018 12:42 PM, Arthur Zakirov wrote: > > Hello all, > > > > I'd like to add new optional function to text search template named fini > > in addition to init() and lexize(). It will be called by > > RemoveTSDictionaryById() and AlterTSDictionary(). dispell_fini() will > > call ts_dict_shmem_release(). > > > > It doesn't change segments leaking situation. I think it makes text > > search API more transparent. > > > > If it doesn't actually solve the problem, why add it? I don't see a > point in adding functions for the sake of transparency, when it does not > in fact serve any use cases. It doesn't solve the problem. But it brings more clearness, if a dictionary requested shared location then it should release/unpin it. There are no such scenario yet, but someone might want to release not only shared segment but also other private data. Can't we handle the segment-leaking by adding some sort of tombstone? It is interesting that there are such tombstones already, without the patch. TSDictionaryCacheEntry entries aren't deleted after DROP, they are just marked isvalid = false. > For example, imagine that instead of removing the hash table entry we > mark it as 'dropped'. And after that, after the lookup we would know the > dictionary was removed, and the backends would load the dictionary into > their private memory. > > Of course, this could mean we end up having many tombstones in the hash > table. But those tombstones would be tiny, making it less painful than > potentially leaking much more memory for the dictionaries. Now actually Isn't guaranteed that the hash table entry will be removed. Even if refcnt is 0. So I think I should remove refcnt and entries won't be removed. There are no big problems with leaking now. Memory may leak only if a dictionary was dropped or altered and there is no text search workload anymore and the backend still alive. Because next using of text search functions will unpin segments used before for invalid dictionaries (isvalid == false). Also the segment is unpinned if the backend terminates. The segment is destroyed when all interested processes unpin the segment (as Tom noticed), the hash table entry becomes tombstone. I hope I described clear. > Also, I wonder if we might actually remove the dictionaries after a > while, e.g. based on XID. Imagine that we note the XID of the > transaction removing the dictionary, or perhaps XID of the most recent > running transaction. Then we could use this to decide if all running > transactions actually see the DROP, and we could remove the tombstone. Maybe autovacuum should work here too :) It is joke of course. I'm not very aware of removing dead tuples, but I think here is similar case. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] A design for amcheck heapam verification
On 2018-03-31 11:27:14 -0700, Peter Geoghegan wrote: > On Fri, Mar 30, 2018 at 7:04 PM, Andres Freund wrote: > > I'm just saying that there should be two functions here, rather than > > dropping the old definition, and creating s new one with a default argument. > > So you're asking for something like bt_index_check_heap() + > bt_index_parent_check_heap()? Or, are you talking about function > overloading? The latter. That addresses my concerns about dropping the function and causing issues due to dependencies. Greetings, Andres Freund
Re: [HACKERS] A design for amcheck heapam verification
On Sat, Mar 31, 2018 at 2:56 PM, Andres Freund wrote: >> So you're asking for something like bt_index_check_heap() + >> bt_index_parent_check_heap()? Or, are you talking about function >> overloading? > > The latter. That addresses my concerns about dropping the function and > causing issues due to dependencies. WFM. I have all the information I need to produce the next revision now. -- Peter Geoghegan
Re: [HACKERS] A design for amcheck heapam verification
On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan wrote: > WFM. I have all the information I need to produce the next revision now. I might as well post this one first. I'll have 0002 for you in a short while. -- Peter Geoghegan v8-0001-Add-Bloom-filter-data-structure-implementation.patch Description: Binary data
Re: Foreign keys and partitioned tables
Peter Eisentraut wrote: > On 3/29/18 23:19, Alvaro Herrera wrote: > > 0003 is the matter of interest. This is essentially the same code I > > posted earlier, rebased to the committed row triggers patch, with a few > > minor bug fixes and some changes in the regression tests to try and make > > them more comprehensive, including leaving a partitioned table with an > > FK to test that the whole pg_dump thing works via the pg_upgrade test. > > I've only read the tests so far. The functionality appears to work > correctly. It's a bit strange how the tests are split up between > alter_table.sql and foreign_key.sql, especially since the latter also > contains ALTER TABLE checks and vice versa. Yeah, I started by putting what I thought was going to be just ALTER TABLE in that test, then moved to the other file and added what I thought were more complete tests there and failed to move stuff to alter_table. Honestly, I think these should mostly all belong in foreign_key, but of course the line is pretty blurry as to what to put in which file. > Some tests are a bit redundant, e.g., this in alter_table.sql: > > +-- these fail: > +INSERT INTO at_partitioned VALUES (1000, 42); > +ERROR: insert or update on table "at_partitioned_0" violates foreign > key constraint "at_partitioned_reg1_col1_fkey" > +DETAIL: Key (reg1_col1)=(42) is not present in table "at_regular1". > > and > > +INSERT INTO at_partitioned VALUES (5000, 42); > +ERROR: insert or update on table "at_partitioned_0" violates foreign > key constraint "at_partitioned_reg1_col1_fkey" > +DETAIL: Key (reg1_col1)=(42) is not present in table "at_regular1". Oh, right. I had some of these to support the case of a FK pointing to a partitioned PK, but then deleted the other partitioned table that this referred to, so the test looks kinda silly without the stuff that was previously interspersed there. I think I'll remove everything from alter_table and just add what's missing to foreign_key. > There are no documentation changes. The foreign key section in CREATE > TABLE does not contain anything about partitioned tables, which is > probably an existing omission, but it might be good to fix this up now. Good catch. I propose this in the PARTITIONED BY section: - Partitioned tables do not support EXCLUDE or - FOREIGN KEY constraints; however, you can define - these constraints on individual partitions. + Partitioned tables do not support EXCLUDE constraints; + however, you can define these constraints on individual partitions. + Also, while it's possible to define PRIMARY KEY + constraints on partitioned tables, it is not supported to create foreign + keys cannot that reference them. This restriction will be lifted in a + future release. I propose this under the REFERENCES clause: These clauses specify a foreign key constraint, which requires that a group of one or more columns of the new table must only contain values that match values in the referenced column(s) of some row of the referenced table. If the refcolumn list is omitted, the primary key of the reftable is used. The referenced columns must be the columns of a non-deferrable unique or primary key constraint in the referenced table. The user must have REFERENCES permission on the referenced table (either the whole table, or the specific referenced columns). Note that foreign key constraints cannot be defined between temporary - tables and permanent tables. + tables and permanent tables. Also note that while it is permitted to + define a foreign key on a partitioned table, declaring a foreign key + that references a partitioned table is not allowed. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: some last patches breaks plan cache
On 03/31/2018 08:28 PM, Tomas Vondra wrote: > > > On 03/31/2018 07:56 PM, Tomas Vondra wrote: >> On 03/31/2018 07:38 PM, Pavel Stehule wrote: >>> Hi >>> >>> CREATE OR REPLACE PROCEDURE public.proc(a integer, INOUT b integer, c >>> integer) >>> LANGUAGE plpgsql >>> AS $procedure$ >>> begin >>> b := a + c; >>> end; >>> $procedure$ >>> >>> CREATE OR REPLACE PROCEDURE public.testproc() >>> LANGUAGE plpgsql >>> AS $procedure$ >>> declare r int; >>> begin >>> call proc(10, r, 20); >>> end; >>> $procedure$ >>> >>> postgres=# call testproc(); >>> CALL >>> postgres=# call testproc(); >>> ERROR: SPI_execute_plan_with_paramlist failed executing query "CALL >>> proc(10, r, 20)": SPI_ERROR_ARGUMENT >>> CONTEXT: PL/pgSQL function testproc() line 4 at CALL >>> postgres=# >>> >>> second call fails >> >> Yeah. >> >> d92bc83c48bdea9888e64cf1e2edbac9693099c9 seems to have broken this :-/ >> > > FWIW it seems the issue is somewhere in exec_stmt_call, which does this: > > /* > * Don't save the plan if not in atomic context. Otherwise, > * transaction ends would cause warnings about plan leaks. > */ > exec_prepare_plan(estate, expr, 0, estate->atomic); > > When executed outside transaction, CALL has estate->atomic=false, and so > calls exec_prepare_plan() with keepplan=false. And on the second call it > gets bogus Plan, of course (with the usual 0x7f7f7f7f7f7f7f7f patterns). > > When in a transaction, it sets keepplan=true, and everything works fine. > > So either estate->atomic is not sufficient on it's own, or we need to > reset the expr->plan somewhere. > The attached patch fixes this, but I'm not really sure it's the right fix - I'd expect there to be a more principled way, doing resetting the plan pointer when 'plan->saved == false'. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index fc0f0f0..ae8bed0 100644 --- a/src/pl/plpgsql/src/pl_exec.c +++ b/src/pl/plpgsql/src/pl_exec.c @@ -2193,6 +2193,9 @@ exec_stmt_call(PLpgSQL_execstate *estate, PLpgSQL_stmt_call *stmt) exec_eval_cleanup(estate); SPI_freetuptable(SPI_tuptable); + if (!expr->plan->saved) + expr->plan = NULL; + return PLPGSQL_RC_OK; }
Re: [HACKERS] A design for amcheck heapam verification
On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan wrote: >> WFM. I have all the information I need to produce the next revision now. > > I might as well post this one first. I'll have 0002 for you in a short while. Attached is 0002 -- the amcheck enhancement itself. As requested by Andres, this adds a new overloaded set of functions, rather than dropping and recreating functions to change their signature. I'm pretty sure that avoiding issues with dependencies by using this approach is unprecedented, so I had to use my own judgement on how to deal with a couple of things. I decided not to create a new C symbol for the new function versions, preferring to leave it to the existing PG_NARGS() tests. I guess this was probably what you intended I should do, based on your "Given the PG_NARGS() checks..." remark. I also chose to not document the single argument functions in the docs. I suppose that we should consider these to be implementation details of a work-around for dependency breakage, something that doesn't need to be documented. That's a bit like how we don't document functions within certain extensions that are designed just to get called within a view definition. I don't feel strongly about it, though. No other changes to report. I did mention that this would have a few small changes yesterday; no need to repeat the details now. Thanks -- Peter Geoghegan From 9e2c1469013374117209c9c0c00e12ecf10ab7b9 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Tue, 2 May 2017 00:19:24 -0700 Subject: [PATCH v8 2/2] Add amcheck verification of heap relations. Add a new, optional capability to bt_index_check() and bt_index_parent_check(): check that each heap tuple that should have an index entry does in fact have one. The extra checking is performed at the end of the existing nbtree checks. This is implemented by using a Bloom filter data structure. The implementation performs set membership tests within a callback (the same type of callback that each index AM registers for CREATE INDEX). The Bloom filter is populated during the initial index verification scan. Reusing the CREATE INDEX infrastructure allows the new verification option to automatically benefit from the heap consistency checks that CREATE INDEX already performs. CREATE INDEX does thorough sanity checking of HOT chains, so the new check actually manages to detect problems in heap-only tuples. Author: Peter Geoghegan Reviewed-By: Pavan Deolasee, Andres Freund Discussion: https://postgr.es/m/CAH2-Wzm5VmG7cu1N-H=nnS57wZThoSDQU+F5dewx3o84M+jY=g...@mail.gmail.com --- contrib/amcheck/Makefile | 2 +- contrib/amcheck/amcheck--1.0--1.1.sql| 29 +++ contrib/amcheck/amcheck.control | 2 +- contrib/amcheck/expected/check_btree.out | 12 +- contrib/amcheck/sql/check_btree.sql | 7 +- contrib/amcheck/verify_nbtree.c | 343 --- doc/src/sgml/amcheck.sgml| 126 +--- 7 files changed, 458 insertions(+), 63 deletions(-) create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile index 43bed91..c5764b5 100644 --- a/contrib/amcheck/Makefile +++ b/contrib/amcheck/Makefile @@ -4,7 +4,7 @@ MODULE_big = amcheck OBJS = verify_nbtree.o $(WIN32RES) EXTENSION = amcheck -DATA = amcheck--1.0.sql +DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql PGFILEDESC = "amcheck - function for verifying relation integrity" REGRESS = check check_btree diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql new file mode 100644 index 000..088416e --- /dev/null +++ b/contrib/amcheck/amcheck--1.0--1.1.sql @@ -0,0 +1,29 @@ +/* contrib/amcheck/amcheck--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit + +-- In order to avoid issues with dependencies when updating amcheck to 1.1, +-- create new, overloaded versions of the 1.0 functions + +-- +-- bt_index_check() +-- +CREATE FUNCTION bt_index_check(index regclass, +heapallindexed boolean) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- +-- bt_index_parent_check() +-- +CREATE FUNCTION bt_index_parent_check(index regclass, +heapallindexed boolean) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE C STRICT PARALLEL RESTRICTED; + +-- Don't want these to be available to public +REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC; +REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC; diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control index 05e2861..4690484 100644 --- a/contrib/amcheck/amcheck.control +++ b/contrib/amcheck/amcheck.control @@ -1,5 +1,5 @@ # amcheck extension comment = 'functions for verifying relation integrity' -default_version = '1.0' +default_vers
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote: > Craig Ringer writes: >> So we should just use the big hammer here. > > And bitch, loudly and publicly, about how broken this kernel behavior is. > If we make enough of a stink maybe it'll get fixed. That won't fix anything released already, so as per the information gathered something has to be done anyway. The discussion of this thread is spreading quite a lot actually. Handling things at a low-level looks like a better plan for the backend. Tools like pg_basebackup and pg_dump also issue fsync's on the data created, we should do an equivalent for them, with some exit() calls in file_utils.c. As of now failures are logged to stderr but not considered fatal. -- Michael signature.asc Description: PGP signature
Re: WIP: Covering + unique indexes.
On Fri, Mar 30, 2018 at 6:24 AM, Alexander Korotkov wrote: >> With an extreme enough case, this could result in a failure to find a >> split point. Or at least, if that isn't true then it's not clear why, >> and I think it needs to be explained. > > > I don't think this could result in a failure to find a split point. > So, it finds a split point without taking into account that hikey > will be shorter. If such split point exists then split point with > truncated hikey should also exists. If not, then it would be > failure even without covering indexes. I've updated comment > accordingly. You're right. We're going to truncate the unneeded trailing attributes from whatever tuple is to the immediate right of the final split point that we choose (that's the tuple that we'll copy to make a new high key for the left page). Truncation already has to result in a tuple that is less than or equal to the original tuple. I also agree that it isn't worth trying harder to make sure that space is distributed evenly when truncation will go ahead. It will only matter in very rare cases, but the computational overhead of having an accurate high key size for every candidate split point would be significant. -- Peter Geoghegan
Re: [HACKERS] A design for amcheck heapam verification
On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan wrote: > On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan wrote: >> WFM. I have all the information I need to produce the next revision now. > > I might as well post this one first. I'll have 0002 for you in a short while. Looks like thrips doesn't like this, though other Windows buildfarm animals are okay with it. round() is from C99, apparently. I'll investigate a fix. -- Peter Geoghegan
Re: [HACKERS] A design for amcheck heapam verification
On 2018-03-31 19:43:45 -0700, Peter Geoghegan wrote: > On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan wrote: > > On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan wrote: > >> WFM. I have all the information I need to produce the next revision now. > > > > I might as well post this one first. I'll have 0002 for you in a short > > while. > > Looks like thrips doesn't like this, though other Windows buildfarm > animals are okay with it. > > round() is from C99, apparently. I'll investigate a fix. Just replacing it with a floor(val + 0.5) ought to do the trick, right? Greetings, Andres Freund
Re: [HACKERS] A design for amcheck heapam verification
On Sat, Mar 31, 2018 at 8:08 PM, Andres Freund wrote: >> round() is from C99, apparently. I'll investigate a fix. > > Just replacing it with a floor(val + 0.5) ought to do the trick, right? I was thinking of using rint(), which is what you get if you call round(float8) from SQL. -- Peter Geoghegan
Re: [HACKERS] A design for amcheck heapam verification
On Sat, Mar 31, 2018 at 8:09 PM, Peter Geoghegan wrote: > I was thinking of using rint(), which is what you get if you call > round(float8) from SQL. Attached patch does it that way. Note that there are float/int cast regression tests that ensure that rint() behaves consistently on supported platforms -- see commit 06bf0dd6. -- Peter Geoghegan From ff0bd32d33ceb6b5650e28d76ee794961862a4fc Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sat, 31 Mar 2018 19:54:48 -0700 Subject: [PATCH] Fix non-portable call to round(). round() is from C99. Apparently, not all supported versions of Visual Studio make it available. Use rint() instead. There are behavioral differences between round() and rint(), but they should not matter to the Bloom filter optimal_k() function. We already assume POSIX behavior for rint(), so there is no question of rint() not using "rounds towards nearest" as its rounding mode. Cleanup from commit 51bc271790eb234a1ba4d14d3e6530f70de92ab5. Per buildfarm member thrips. Author: Peter Geoghegan --- src/backend/lib/bloomfilter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c index eb08f4a..3565480 100644 --- a/src/backend/lib/bloomfilter.c +++ b/src/backend/lib/bloomfilter.c @@ -240,7 +240,7 @@ my_bloom_power(uint64 target_bitset_bits) static int optimal_k(uint64 bitset_bits, int64 total_elems) { - int k = round(log(2.0) * bitset_bits / total_elems); + int k = rint(log(2.0) * bitset_bits / total_elems); return Max(1, Min(k, MAX_HASH_FUNCS)); } -- 2.7.4
Re: [HACKERS][PATCH] adding simple sock check for windows
On Sat, Mar 31, 2018 at 11:33 PM, Tom Lane wrote: > ... Oh, just to make things even more fun, PQsocket() returns int, not > pgsocket; see its header comment. Therefore, that test is correctly > coded as-is (though it's still useless), and the real problem is that > ParallelSlot.sock ought to be declared int not pgsocket. If you look > around at our other places that read PQsocket() and use its result in > select() masks, they uniformly use int variables, and I think they're > right. > There is one other place where we are using pgsocket, see walrcv_receive, but we are using Assert in that place. I think it would be better if the output of PQsocket() can be consistently used everywhere. If there is no particular restriction, then it will be good to make it as 'int' everywhere. > Actually, the more I look at this code, the more problems I'm finding. > The wait-for-a-free-connection code is just broken on its face, because > it supposes it can retry select() waits without reinitializing the > FD_SET. The loop in vacuum_one_database() uses "conn" to call > prepare_vacuum_command() without any thought as to whether that is a > free connection --- which, typically, it isn't. It looks to me like > libpq silently copes with this silliness, just waiting for the active > command to finish and then doing what it's told, but the net result > is that supposedly-concurrent vacuums get serialized. > I think it would have been better if this code would have found the free_slot before preparing the command and then used the connection from free slot. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] A design for amcheck heapam verification
On 2018-03-31 20:25:24 -0700, Peter Geoghegan wrote: > On Sat, Mar 31, 2018 at 8:09 PM, Peter Geoghegan wrote: > > I was thinking of using rint(), which is what you get if you call > > round(float8) from SQL. > > Attached patch does it that way. Note that there are float/int cast > regression tests that ensure that rint() behaves consistently on > supported platforms -- see commit 06bf0dd6. LGTM, pushed. Closing CF entry. Yay! Only 110 to go. - Andres
Re: [HACKERS] A design for amcheck heapam verification
On Sat, Mar 31, 2018 at 8:32 PM, Andres Freund wrote: > LGTM, pushed. Closing CF entry. Yay! Only 110 to go. Thanks Andres! -- Peter Geoghegan
Re: [HACKERS][PATCH] adding simple sock check for windows
On Sun, Apr 1, 2018 at 8:59 AM, Amit Kapila wrote: > On Sat, Mar 31, 2018 at 11:33 PM, Tom Lane wrote: >> ... Oh, just to make things even more fun, PQsocket() returns int, not >> pgsocket; see its header comment. Therefore, that test is correctly >> coded as-is (though it's still useless), and the real problem is that >> ParallelSlot.sock ought to be declared int not pgsocket. If you look >> around at our other places that read PQsocket() and use its result in >> select() masks, they uniformly use int variables, and I think they're >> right. >> > > There is one other place where we are using pgsocket, see > walrcv_receive, but we are using Assert in that place. I think it > would be better if the output of PQsocket() can be consistently used > everywhere. If there is no particular restriction, then it will be > good to make it as 'int' everywhere. > > >> Actually, the more I look at this code, the more problems I'm finding. >> The wait-for-a-free-connection code is just broken on its face, because >> it supposes it can retry select() waits without reinitializing the >> FD_SET. The loop in vacuum_one_database() uses "conn" to call >> prepare_vacuum_command() without any thought as to whether that is a >> free connection --- which, typically, it isn't. It looks to me like >> libpq silently copes with this silliness, just waiting for the active >> command to finish and then doing what it's told, but the net result >> is that supposedly-concurrent vacuums get serialized. >> > > I think it would have been better if this code would have found the > free_slot before preparing the command and then used the connection > from free slot. > oops, I just saw that you have already pushed a fix for it. I am not sure if we should try to do anything about walrcv_receive's output still uses pgsocket instead of int as the usage in itself doesn't have any problem. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] why not parallel seq scan for slow functions
On Fri, Mar 30, 2018 at 1:41 AM, Robert Haas wrote: > On Thu, Mar 29, 2018 at 12:55 AM, Amit Kapila wrote: >> I think that is under acceptable range. I am seeing few regression >> failures with the patch series. The order of targetlist seems to have >> changed for Remote SQL. Kindly find the failure report attached. I >> have requested my colleague Ashutosh Sharma to cross-verify this and >> he is also seeing the same failures. > > Oops. Those just require an expected output change. > >> It seems UPPERREL_TLIST is redundant in the patch now. I think we can >> remove it unless you have something else in mind. > > Yes. > >> I think the handling of partitioned rels looks okay, but we might want >> to once check the overhead of the same unless you are sure that this >> shouldn't be a problem. If you think, we should check it once, then >> is it possible that we can do it as a separate patch as this doesn't >> look to be directly linked to the main patch. It can be treated as an >> optimization for partitionwise aggregates. I think we can treat it >> along with the main patch as well, but it might be somewhat simpler to >> verify it if we do it separately. > > I don't think it should be a problem, although you're welcome to test > it if you're concerned about it. I think it would probably be > penny-wise and pound-foolish to worry about the overhead of > eliminating the Result nodes, which can occur not only with > partition-wise aggregate but also with partition-wise join and, I > think, really any case where the top scan/join plan would be an Append > node. We're talking about a very small amount of additional planning > time to potentially get a better plan. > > I've committed all of these now. > Cool, I have closed the corresponding CF entry. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
lazy detoasting
Hi, If I copy an out-of-line, on-disk TOAST pointer into a memory context with transaction lifetime, with an eye to detoasting it later in the same transaction, are there circumstances where it wouldn't work? Thanks, -Chap
Re: lazy detoasting
Chapman Flack writes: > If I copy an out-of-line, on-disk TOAST pointer into a memory context > with transaction lifetime, with an eye to detoasting it later in the > same transaction, are there circumstances where it wouldn't work? Should be safe *as long as you hold onto a snapshot that can see the toast value's parent row*. Otherwise, it's not. For a counterexample, see the changes we had to make to avoid depending on out-of-line toast values in the catcaches, commit 08e261cbc. regards, tom lane
Re: [HACKERS][PATCH] adding simple sock check for windows
Amit Kapila writes: > oops, I just saw that you have already pushed a fix for it. I am not > sure if we should try to do anything about walrcv_receive's output > still uses pgsocket instead of int as the usage in itself doesn't have > any problem. I see a few places where we're still assigning PQsocket's result to a pgsocket variable, but none of them are worrying about the possibility that it's not a valid socket, and I'm not sure they need to. regards, tom lane
Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath
On 16 March 2018 at 02:46, Robert Haas wrote: > If we stick with your idea of using AppendPath, do we actually need > generate_proxy_paths()? What paths get lost if we don't have a > special case for them here? Actually, we do a surprisingly good job of allowing plan shapes to stay the same when planning for an only-child scan for an Append or MergeAppend. The main difference I did find was that and Append and MergeAppend don't support Mark and Restore, so if you just generated the same paths and simply skipped over Appends and MergeAppends you'd still be left with Materialize nodes which might not actually be required at all. This might not be huge, but seeing this made me worried that there might be some valid reason, if not today, then sometime in the future why it might not be safe to simply pluck the singleton Append/MergeAppend nodes out the plan tree. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: csv format for psql
Hello Isaac, Personnaly I do not have any problem with CSV defaulting to '|' separator, given that anyway people often use anything but a comma for the purpose, including '|'. However Pavel wants to block the patch on this point. Too bad. OK, mostly trying to avoid commenting because I doubt I have much to add. But. If I ask for CSV and don't specify any overrides, I expect to get "C"omma separated values, not some other character. More specifically, if I say --csv I expect to get files that are identical with what I would get if I used COPY ... CSV. My summary was incomplete. The --csv option implementation by Daniel already does that. The issue Pavel is complaining about is that in interactive mode "\pset format csv" does not do the same: it triggers the csv-rule string-escaping mechanism, but does not reset the "fieldsep" variable (eh, it sets the "format" variable) so the default separator under this interactive use is "|" if the "fieldsep" variable is shared. I have suggested a "\csv" interactive command to set both as a convenient shorthand for "\pset format csv & \pset fieldsep ','", similarly to --csv, but Pavel still wants "\pset format csv" to trigger the full csv output. A consequence I forgot about adding "fieldsep_csv", is that it probably has to duplicate the "_zero" ugly hack to be consistent with existing "*sep" variables, or else be inconsistent. Sigh. -- Fabien.