Re: [HACKERS] Tweaking Foreign Keys for larger tables
Peter Eisentraut-2 wrote On 10/31/14 6:19 AM, Simon Riggs wrote: Various ways of tweaking Foreign Keys are suggested that are helpful for larger databases. *INITIALLY NOT ENFORCED FK created, but is not enforced during DML. Will be/Must be marked NOT VALID when first created. We can run a VALIDATE on the constraint at any time; if it passes the check it is marked VALID and presumed to stay that way until the next VALIDATE run. Does that mean the FK would become invalid after every DML operation, until you expicitly revalidate it? Is that practical? My read is that it means that you can insert invalid data but the system will pretend it is valid unless someone asks it for confirmation. Upon validation the FK will become invalid until the discrepancy is fixed and another validation is performed. ON DELETE IGNORE ON UPDATE IGNORE If we allow this specification then the FK is one way - we check the existence of a row in the referenced table, but there is no need for a trigger on the referenced table to enforce an action on delete or update, so no need to lock the referenced table when adding FKs. Are you worried about locking the table at all, or about having to lock many rows? Wouldn't you at least need some kind of trigger to make the constraint invalid as soon as any record is updated or removed from the referenced table since in all likelihood the FK relationship has just been broken? How expensive is validation going to be? Especially, can validation occur incrementally or does every record need to be validated each time? Is this useful for master-detail setups, record-category, or both (others?)? Will optimizations over invalid data give incorrect answers and in what specific scenarios can that be expected? I get the idea of having a system that let's you skip constant data validation since in all likelihood once in production some scenarios would be extremely resistant to the introduction of errors and can be dealt with on-the-fly. Trust only since the verify is expensive - but keep the option open and the model faithfully represented. I don't know that I would ever think to use this in my world since the additional admin effort is obvious but the cost of the thing I'd be avoiding is vague. As it is now someone could simply drop their FK constraints and run a validation query periodically to see if the data being inserted is correct. That doesn't allow for optimizations to take place though and so this is an improvement; but the documentation and support aspects for a keep/drop decision can be fleshed out first as that would be valuable in its own right. Then go about figuring out how to make a hybrid implementation work. Put another way: at what point does the cost of the FK constraint outweigh the optimization savings? While size is obvious both schema and read/write patterns likely have a significant influence. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Tweaking-Foreign-Keys-for-larger-tables-tp5825162p5825891.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tweaking Foreign Keys for larger tables
On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote: ON DELETE IGNORE ON UPDATE IGNORE If we allow this specification then the FK is one way - we check the existence of a row in the referenced table, but there is no need for a trigger on the referenced table to enforce an action on delete or update, so no need to lock the referenced table when adding FKs. Are you worried about locking the table at all, or about having to lock many rows? This is useful for smaller, highly referenced tables that don't change much, if ever. In that case the need for correctness thru locking is minimal. If we do lock it will cause very high multixact traffic, so that is worth avoiding alone. The main issue is referencing a table many times. Getting a full table lock can halt all FK checks, so skipping adding the trigger altogether avoids freezing up everything just for a trigger that doesn't actually do much. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tweaking Foreign Keys for larger tables
On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote: On 10/31/14 6:19 AM, Simon Riggs wrote: Various ways of tweaking Foreign Keys are suggested that are helpful for larger databases. *INITIALLY NOT ENFORCED FK created, but is not enforced during DML. Will be/Must be marked NOT VALID when first created. We can run a VALIDATE on the constraint at any time; if it passes the check it is marked VALID and presumed to stay that way until the next VALIDATE run. Does that mean the FK would become invalid after every DML operation, until you expicitly revalidate it? Is that practical? I think so. We store the validity on the relcache entry. Constraint would add a statement-level after trigger for insert, update, delete and trigger, which issues a relcache invalidation if the state was marked valid. Marked as deferrable initially deferred. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Tue, Nov 4, 2014 at 2:03 PM, Rahila Syed rahilasye...@gmail.com wrote: Hello , Please find updated patch with the review comments given above implemented Hunk #3 FAILED at 692. 1 out of 3 hunks FAILED -- saving rejects to file src/backend/access/transam/xlogreader.c.rej The patch was not applied to the master cleanly. Could you update the patch? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE IF NOT EXISTS INDEX
On Sat, Nov 1, 2014 at 1:56 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Fri, Oct 31, 2014 at 2:46 PM, Adam Brightwell adam.brightw...@crunchydatasolutions.com wrote: All, FWIW, I've cleanly applied v8 of this patch to master (252e652) and check-world was successful. I also successfully ran through a few manual test cases. Thanks for your review! Applied. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 11/05/2014 05:01 AM, Petr Jelinek wrote: I guess I could port BDR sequences to this if it would help (once we have bit more solid agreement that the proposed API at least theoretically seems ok so that I don't have to rewrite it 10 times if at all possible). Because the BDR sequences rely on all the other BDR machinery I suspect that'd be a pretty big thing to review and follow for someone who doesn't know the BDR code. Do you think it'd be simple to provide a blocking, transactional sequence allocator via this API? i.e. gapless sequences, much the same as typically implemented with SELECT ... FOR UPDATE on a counter table. It might be more digestible standalone, and would be a handy contrib/ example extension demonstrating use of the API. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Typo in comment
I ran into a typo in a comment in src/backend/commands/matview.c. Please find attached a patch. Best regards, Etsuro Fujita diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c index 30bd40d..db05f7c 100644 --- a/src/backend/commands/matview.c +++ b/src/backend/commands/matview.c @@ -473,7 +473,7 @@ transientrel_destroy(DestReceiver *self) * the given integer, to make a new table name based on the old one. * * This leaks memory through palloc(), which won't be cleaned up until the - * current memory memory context is freed. + * current memory context is freed. */ static char * make_temptable_name_n(char *tempname, int n) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] two dimensional statistics in Postgres
On 06/11/14 23:15, Katharina Büchse wrote: Hi, I'm a phd-student at the university of Jena, Thüringen, Germany, in the field of data bases, more accurate query optimization. I want to implement a system in PostgreSQL that detects column correlations and creates statistical data about correlated columns for the optimizer. Therefore I need to store two dimensional statistics (especially two dimensional histograms) in PostgreSQL. I had a look at the description of WIP: multivariate statistics / proof of concept, which looks really promising, I guess these statistics are based on scans of the data? For my system I need both -- statistical data based on table scans (actually, samples are enough) and those based on query feedback. Query feedback (tuple counts and, speaking a little inaccurately, the where-part of the query itself) needs to be extracted and there needs to be a decision for the optimizer, when to take multivariate statistics and when to use the one dimensional ones. Oracle in this case just disables one dimensional histograms if there is already a multidimensional histogram, but this is not always useful, especially in the case of a feedback based histogram (which might not cover the whole data space). I want to use both kinds of histograms because correlations might occur only in parts of the data. In this case a histogram based on a sample of the whole table might not get the point and wouldn't help for the part of the data the user seems to be interested in. There are special data structures for storing multidimensional histograms based on feedback and I already tried to implement one of these in C. In the case of two dimensions they are of course not for free (one dimensional would be much cheaper), but based on the principle of maximum entropy they deliver really good results. I decided for only two dimensions because in this case we have the best proportion of cost and benefit when searching for correlation (here I'm relying on tests that were made in DB2 within a project called CORDS which detects correlations even between different tables). I'd be grateful for any advices and discussions. Regards, Katharina Could you store a 2 dimensional histogram in a one dimensional array: A[z] = value, where z = col * rowSize + row (zero starting index)? Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport
On 11/06/2014 02:30 AM, Peter Geoghegan wrote: Thanks for the review. On Wed, Nov 5, 2014 at 4:33 PM, Andreas Karlsson andr...@proxel.se wrote: I looked at the changes to the code. The new code is clean and there is more code re-use and improved readability. On possible further improvement would be to move the preparation of SortSupport to a common function since this is done three time in the code. The idea there is to have more direct control of sortsupport. With the abbreviated keys patch, abbreviation occurs based on a decision made by tuplesort.c. I can see why you'd say that, but I prefer to keep initialization of sortsupport structs largely concentrated in tuplesort.c, and more or less uniform regardless of the tuple-type being sorted. Ok, I am fine with either. Is there any case where we should expect any greater performance improvement? The really compelling case is abbreviated keys - as you probably know, there is a patch that builds on this patch, and the abbreviated keys patch, so that B-Tree builds and CLUSTER can use abbreviated keys too. That doesn't really have much to do with this patch, though. The important point is that heap tuple sorting (with a query that has no client overhead, and involves one big sort) and B-Tree index creation both have their tuplesort as a totally dominant cost. The improvements for each case should be quite comparable, which is good (except, as noted in my opening e-mail, when heap/datum tuplesorting can use the onlyKey optimization, while B-Tree/CLUSTER tuplesorting cannot). Ah, I see. -- Andreas Karlsson -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] two dimensional statistics in Postgres
Hi, Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a): Hi, I'm a phd-student at the university of Jena, Thüringen, Germany, in the field of data bases, more accurate query optimization. I want to implement a system in PostgreSQL that detects column correlations and creates statistical data about correlated columns for the optimizer. Therefore I need to store two dimensional statistics (especially two dimensional histograms) in PostgreSQL. Cool! I had a look at the description of WIP: multivariate statistics / proof of concept, which looks really promising, I guess these statistics are based on scans of the data? For my system I need both -- statistical Yes, it's based on a sample of the data. data based on table scans (actually, samples are enough) and those based on query feedback. Query feedback (tuple counts and, speaking a little inaccurately, the where-part of the query itself) needs to be extracted and there needs to be a decision for the optimizer, when to take multivariate statistics and when to use the one dimensional ones. Oracle in this case just disables one dimensional histograms if there is already a multidimensional histogram, but this is not always useful, especially in the case of a feedback based histogram (which might not cover the whole data space). I want to use both kinds of histograms What do you mean by not covering the whole data space? I assume that when building feedback-based histogram, parts of the data will be filtered out because of WHERE clauses etc. Is that what you mean? I don't see how this could happen for regular histograms, though. because correlations might occur only in parts of the data. In this case a histogram based on a sample of the whole table might not get the point and wouldn't help for the part of the data the user seems to be interested in. Yeah, there may be dependencies that are difficult to spot in the whole dataset, but emerge once you filter to a specific subset. Now, how would that work in practice? Initially the query needs to be planned using regular stats (because there's no feedback yet), and then - when we decide the estimates are way off - may be re-planned using the feedback. The feedback is inherently query-specific, so I'm not sure if it's possible to reuse it for multiple queries (might be possible for sufficiently similar ones). Would this be done automatically for all queries / all conditions, or only when specifically enabled (on a table, columns, ...)? There are special data structures for storing multidimensional histograms based on feedback and I already tried to implement one of these in C. In the case of two dimensions they are of course not for free (one dimensional would be much cheaper), but based on the principle of maximum entropy they deliver really good results. I decided for only two dimensions because in this case we have the best proportion of cost and benefit when searching for correlation (here I'm relying on I think hardcoding the two-dimensions limit is wrong. I understand higher number of dimensions means more expensive operation, but if the user can influence it, I believe it's OK. Also, is there any particular reason why not to support other kinds of stats (say, MCV lists)? In the end it's just a different way to approximate the distribution, and it may be way cheaper than histograms. tests that were made in DB2 within a project called CORDS which detects correlations even between different tables). Is this somehow related to LEO? I'm not familiar with the details, but from the description it might be related. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] two dimensional statistics in Postgres
Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a): Could you store a 2 dimensional histogram in a one dimensional array: A[z] = value, where z = col * rowSize + row (zero starting index)? How would that work for columns with different data types? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : REINDEX SCHEMA
On Fri, Oct 24, 2014 at 3:41 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Wed, Oct 22, 2014 at 9:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Sawada Masahiko wrote: Thank you for reviewing. I agree 2) - 5). Attached patch is latest version patch I modified above. Also, I noticed I had forgotten to add the patch regarding document of reindexdb. Please don't use pg_catalog in the regression test. That way we will need to update the expected file whenever a new catalog is added, which seems pointless. Maybe create a schema with a couple of tables specifically for this, instead. Attached new regression test. Hunk #1 FAILED at 1. 1 out of 2 hunks FAILED -- saving rejects to file src/bin/scripts/t/090_reindexdb.pl.rej I tried to apply the 001 patch after applying the 000, but it was not applied cleanly. At least to me, it's more intuitive to use SCHEMA instead of ALL IN SCHEMA here because we already use DATABASE instead of ALL IN DATABASE. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typo in comment
On Thu, Nov 6, 2014 at 7:44 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I ran into a typo in a comment in src/backend/commands/matview.c. Please find attached a patch. Thanks! Applied. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] two dimensional statistics in Postgres
On 06/11/14 23:57, Tomas Vondra wrote: Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a): Could you store a 2 dimensional histogram in a one dimensional array: A[z] = value, where z = col * rowSize + row (zero starting index)? How would that work for columns with different data types? Tomas I implicitly assumed that all cells were the same size type. However, I could devise a scheme to cope with columns of different types, given the relevant definitions - but this would obviously be more complicated. Also this method can be extended into higher dimensions. Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Order of views in stats docs
On Wed, Nov 5, 2014 at 4:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: http://www.postgresql.org/docs/9.3/static/monitoring-stats.html, table 27-1. Can somebody find or explain the order of the views in there? It's not actually alphabetical, but it's also not logical. In particular, what is pg_stat_replication doing second to last? I would suggest we move pg_stat_replication up to directly under pg_stat_activity, and move pg_stat_database_conflicts up to directly under pg_stat_database. I think the rest makes reasonable sense. Any objections to this? Can anybody spot a reason for why they are where they are other than that it was just appended to the end of the table without realizing the order that I'm missing now and am about to break? I agree that the last two items seem to be suffering from blindly-add- it-to-the-end syndrome, which is a disease that runs rampant around here. However, should we consider the possibility of changing the table to straight alphabetical ordering? I'm not as much in love with that approach as some folks, but it does have the merit that it's always clear where you ought to put a new item. This would result in grouping the all, sys, and user views separately, rather than grouping those variants of a view together ... but on reflection I'm not sure that that'd be totally horrible. That would at least make it very predictable, yes. Another thought I had in that case is maybe we need to break out the pg_stat_activity and pg_stat_replication views into their own table. They are really the only two views that are different in a lot of ways. Maybe call the splits session statistics views and object statistics views, instead of just standard statistics views? The difference being that they show snapshots of *right now*, whereas all the other views show accumulated statistics over time. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] two dimensional statistics in Postgres
Dne 6 Listopad 2014, 12:05, Gavin Flower napsal(a): On 06/11/14 23:57, Tomas Vondra wrote: Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a): Could you store a 2 dimensional histogram in a one dimensional array: A[z] = value, where z = col * rowSize + row (zero starting index)? How would that work for columns with different data types? Tomas I implicitly assumed that all cells were the same size type. However, I could devise a scheme to cope with columns of different types, given the relevant definitions - but this would obviously be more complicated. Also this method can be extended into higher dimensions. Which is what I did in the WIP: multivariate stats ;-) Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
On 06/11/14 02:06, Jim Nasby wrote: On 11/5/14, 6:04 PM, Álvaro Hernández Tortosa wrote: On 05/11/14 17:46, Jim Nasby wrote: On 11/4/14, 6:11 PM, Álvaro Hernández Tortosa wrote: Should we improve then the docs stating this more clearly? Any objection to do this? If we go that route we should also mention that now() will no longer be doing what you probably hope it would (AFAIK it's driven by BEGIN and not the first snapshot). If I understand you correctly, you mean that if we add a note to the documentation stating that the transaction really freezes when you do the first query, people would expect now() to be also frozen when the first query is done, which is not what happens (it's frozen at tx start). Then, yes, you're right, probably *both* the isolation levels and the now() function documentation should be patched to become more precise. Bingo. Hrm, is there anything else that differs between the two? Perhaps we should change how now() works, but I'm worried about what that might do to existing applications... Perhaps, I also believe it might not be good for existing applications, but it definitely has a different freeze behavior, which seems inconsistent too. Yeah, I'd really rather fix it... There has been two comments which seem to state that changing this may introduce some performance problems and some limitations when you need to take out some locks. I still believe, however, that current behavior is confusing for the user. Sure, one option is to patch the documentation, as I was suggesting. But what about creating a flag to BEGIN and SET TRANSACTION commands, called IMMEDIATE FREEZE (or something similar), which applies only to REPEATABLE READ and SERIALIZABLE? If this flag is set (and may be off by default, but of course the default may be configurable via a guc parameter), freeze happens when it is present (BEGIN or SET TRANSACTION) time. This would be a backwards-compatible change, while would provide the option of freezing without the nasty hack of having to do a SELECT 1 prior to your real queries, and everything will of course be well documented. What do you think? Best regards, Álvaro -- Álvaro Hernández Tortosa --- 8Kdata -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 11/05/2014 11:30 AM, Andres Freund wrote: On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote: On 10/30/2014 06:02 PM, Andres Freund wrote: On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote: On 10/06/2014 02:29 PM, Andres Freund wrote: I've not yet really looked, but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't make me happy... Ok.. Can you elaborate? To me the split between xloginsert.c doing some of the record assembly, and xlog.c doing the lower level part of the assembly is just wrong. I moved the checks for bootstrap mode and xl_len == 0, from the current XLogInsert to the new XLogInsert() function. I don't imagine that to be enough address your feelings about the split between xlog.c and xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else to do about it, as it feels very sensible to me as it is now. So unless you object more loudly, I'm going to just go ahead with this and commit, probably tomorrow. I'm still not particularly happy with the split. But I think this patch is enough of a win anyhow. So go ahead. Committed. Now, to the main patch... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb -S and tablespaces
At 2014-10-30 14:30:27 +0530, a...@2ndquadrant.com wrote: Here's a proposed patch to initdb to make initdb -S fsync everything under pg_tblspc. Oops, I meant to include the corresponding patch to xlog.c to do the same at startup. It's based on the initdb patch, but modified to not use fprintf/exit_nicely and so on. (Note that this was written in a single chunk to aid backpatching. There's no attempt made to share code in this set of patches.) Now attached. -- Abhijit From ae91da4df7ee60e6f83c98801e979929442f588a Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Thu, 6 Nov 2014 00:45:56 +0530 Subject: If we need to perform crash recovery, fsync PGDATA recursively This is so that we don't lose older unflushed writes in the event of a power failure after crash recovery, where more recent writes are preserved. See 20140918083148.ga17...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 184 ++ 1 file changed, 184 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ab04380..ef95f64 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -830,6 +830,12 @@ static void WALInsertLockAcquireExclusive(void); static void WALInsertLockRelease(void); static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); +static void pre_sync_fname(char *fname, bool isdir); +static void walkdir(char *path, void (*action) (char *fname, bool isdir)); +static void walktblspc_links(char *path, + void (*action) (char *fname, bool isdir)); +static void perform_fsync(char *pg_data); + /* * Insert an XLOG record having the specified RMID and info bytes, * with the body of the record being the data chunk(s) described by @@ -5981,6 +5987,19 @@ StartupXLOG(void) ereport(FATAL, (errmsg(control file contains invalid data))); + /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + */ + + if (ControlFile-state != DB_SHUTDOWNED + ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY) + perform_fsync(data_directory); + if (ControlFile-state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -11262,3 +11281,168 @@ SetWalWriterSleeping(bool sleeping) XLogCtl-WalWriterSleeping = sleeping; SpinLockRelease(XLogCtl-info_lck); } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +static void +pre_sync_fname(char *fname, bool isdir) +{ +#if defined(HAVE_SYNC_FILE_RANGE) || \ + (defined(USE_POSIX_FADVISE) defined(POSIX_FADV_DONTNEED)) + int fd; + + fd = open(fname, O_RDONLY | PG_BINARY); + + /* + * Some OSs don't allow us to open directories at all (Windows returns + * EACCES) + */ + if (fd 0 isdir (errno == EISDIR || errno == EACCES)) + return; + + if (fd 0) + ereport(FATAL, +(errmsg(could not open file \%s\ before fsync, + fname))); + + /* + * Prefer sync_file_range, else use posix_fadvise. We ignore any error + * here since this operation is only a hint anyway. + */ +#if defined(HAVE_SYNC_FILE_RANGE) + sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); +#elif defined(USE_POSIX_FADVISE) defined(POSIX_FADV_DONTNEED) + posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); +#endif + + close(fd); +#endif +} + +/* + * walkdir: recursively walk a directory, applying the action to each + * regular file and directory (including the named directory itself). + * + * Adapted from copydir() in copydir.c. + */ +static void +walkdir(char *path, void (*action) (char *fname, bool isdir)) +{ + DIR *dir; + struct dirent *de; + + dir = AllocateDir(path); + while ((de = ReadDir(dir, path)) != NULL) + { + char subpath[MAXPGPATH]; + struct stat fst; + + CHECK_FOR_INTERRUPTS(); + + if (strcmp(de-d_name, .) == 0 || + strcmp(de-d_name, ..) == 0) + continue; + + snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name); + + if (lstat(subpath, fst) 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not stat file \%s\: %m, subpath))); + + if (S_ISDIR(fst.st_mode)) + walkdir(subpath, action); + else if (S_ISREG(fst.st_mode)) + (*action) (subpath, false); + } + FreeDir(dir); + + /* + * It's important to fsync the destination directory itself as individual + * file fsyncs don't guarantee that the directory entry for the file is + * synced. Recent versions of ext4 have made the window much wider but + * it's been an issue for ext3 and other filesystems in the past. + */ + (*action) (path, true); +} + +/* + * walktblspc_links: call walkdir on each entry under the given + *
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
Álvaro Hernández Tortosa a...@8kdata.com wrote: There has been two comments which seem to state that changing this may introduce some performance problems and some limitations when you need to take out some locks. I still believe, however, that current behavior is confusing for the user. Sure, one option is to patch the documentation, as I was suggesting. Yeah, I thought that's what we were talking about, and in that regard I agree that the docs could be more clear. I'm not quite sure what to say where to fix that, but I can see how someone could be confused and have the expectation that once they have run BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE the transaction will not see the work of transactions committing after that. The fact that this is possible is implied, if one reads carefully and thinks about it, by the statement right near the start of the Transaction Isolation section which says any concurrent execution of a set of Serializable transactions is guaranteed to produce the same effect as running them one at a time in some order. As Robert pointed out, this is not necessarily the commit order or the transaction start order. It is entirely possible that if you have serializable transactions T1 and T2, where T1 executes BEGIN first (and even runs a query before T2 executes BEGIN) and T1 commits first, that T2 will appear to have run first because it will look at a set of data which T1 modifies and not see the changes. If T1 were to *also* look at a set of data which T2 modifies, then one of the transactions would be rolled back with a serialization failure, to prevent a cycle in the apparent order of execution; so the requirements of the standard (and of most software which is attempting to handle race conditions) is satisfied. For many popular benchmarks (and I suspect most common workloads) this provides the necessary protections with better performance than is possible using blocking to provide the required guarantees.[1] At any rate, the language in that section is a little fuzzy on the concept of the start of the transaction. Perhaps it would be enough to change language like: | sees a snapshot as of the start of the transaction, not as of the | start of the current query within the transaction. to: | sees a snapshot as of the start of the first query within the | transaction, not as of the start of the current query within the | transaction. Would that have prevented the confusion here? But what about creating a flag to BEGIN and SET TRANSACTION commands, called IMMEDIATE FREEZE (or something similar), which applies only to REPEATABLE READ and SERIALIZABLE? If this flag is set (and may be off by default, but of course the default may be configurable via a guc parameter), freeze happens when it is present (BEGIN or SET TRANSACTION) time. This would be a backwards-compatible change, while would provide the option of freezing without the nasty hack of having to do a SELECT 1 prior to your real queries, and everything will of course be well documented. What is the use case where you are having a problem? This seems like an odd solution, so it would be helpful to know what problem it is attempting to solve. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] Dan R. K. Ports and Kevin Grittner. Serializable Snapshot Isolation in PostgreSQL. In VLDB, pages 1850--1861, 2012. http://vldb.org/pvldb/vol5/p1850_danrkports_vldb2012.pdf (see section 8 for performance graphs and numbers) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Order of views in stats docs
On 11/6/14 6:16 AM, Magnus Hagander wrote: Another thought I had in that case is maybe we need to break out the pg_stat_activity and pg_stat_replication views into their own table. They are really the only two views that are different in a lot of ways. Maybe call the splits session statistics views and object statistics views, instead of just standard statistics views? The difference being that they show snapshots of *right now*, whereas all the other views show accumulated statistics over time. Yeah, splitting this up a bit would help, too. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Kevin Grittner wrote: (wording change suggestion) | sees a snapshot as of the start of the first query within the | transaction, not as of the start of the current query within the | transaction. Would that have prevented the confusion here? I think it may have, but I also think the wording should be much stronger and clearer, as this is unintuitive behavior. Consider this snippet from Bruce's excellent MVCC Unmasked presentation: A snapshot is recorded at the start of each SQL statement in READ COMMITTED transaction isolation mode, and at transaction start in SERIALIZABLE transaction isolation mode. This is both correct and incorrect, depending on whether you consider a transaction to start with BEGIN; or with the first statement after the BEGIN. :) I think most people have always assumed that BEGIN starts the transaction and that is the point at which the snapshot is obtained. But what about creating a flag to BEGIN and SET TRANSACTION commands, called IMMEDIATE FREEZE (or something similar), which applies only to REPEATABLE READ and SERIALIZABLE? If this flag is set (and may be off by default, but of course the default may be configurable via a guc parameter), freeze happens when it is present (BEGIN or SET TRANSACTION) time. This would be a backwards-compatible change, while would provide the option of freezing without the nasty hack of having to do a SELECT 1 prior to your real queries, and everything will of course be well documented. What is the use case where you are having a problem? This seems like an odd solution, so it would be helpful to know what problem it is attempting to solve. Seems like a decent solution to me. The problem it that having to execute a dummy SQL statement to start a serializable transaction, rather than simply a BEGIN, is ugly.and error prone. Perhaps their app assumes (or even requires) that BEGIN starts the snapshot. - -- Greg Sabino Mullane g...@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201411060922 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iEYEAREDAAYFAlRbhD4ACgkQvJuQZxSWSsg/kwCdE9E+d3jDDpLOo4+08wCOMMxE EHkAnj4uMO8cY6Jl0R19C/6lE6n3bae5 =syg9 -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Tue, Nov 4, 2014 at 12:04 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: IIUC, I think that min = 0 disables fast update, so ISTM that it'd be appropriate to set min to some positive value. And ISTM that the idea of using the min value of work_mem is not so bad. OK. I changed the min value to 64kB. *** 356,361 CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ replaceable class=parametername/ --- 356,372 /listitem /varlistentry /variablelist +variablelist +varlistentry + termliteralPENDING_LIST_CLEANUP_SIZE//term The above is still in uppercse. Fixed. Attached is the updated version of the patch. Thanks for the review! Regards, -- Fujii Masao *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 5911,5916 SET XML OPTION { DOCUMENT | CONTENT }; --- 5911,5937 /listitem /varlistentry + varlistentry id=guc-pending-list-cleanup-size xreflabel=pending_list_cleanup_size + termvarnamepending_list_cleanup_size/varname (typeinteger/type) + indexterm +primaryvarnamepending_list_cleanup_size/ configuration parameter/primary + /indexterm + /term + listitem +para + Sets the maximum size of the GIN pending list which is used + when literalfastupdate/ is enabled. If the list grows + larger than this maximum size, it is cleaned up by moving + the entries in it to the main GIN data structure in bulk. + The default is four megabytes (literal4MB/). This setting + can be overridden for individual GIN indexes by changing + storage parameters. + See xref linkend=gin-fast-update and xref linkend=gin-tips + for more information. +/para + /listitem + /varlistentry + /variablelist /sect2 sect2 id=runtime-config-client-format *** a/doc/src/sgml/gin.sgml --- b/doc/src/sgml/gin.sgml *** *** 728,735 from the indexed item). As of productnamePostgreSQL/productname 8.4, acronymGIN/ is capable of postponing much of this work by inserting new tuples into a temporary, unsorted list of pending entries. !When the table is vacuumed, or if the pending list becomes too large !(larger than xref linkend=guc-work-mem), the entries are moved to the main acronymGIN/acronym data structure using the same bulk insert techniques used during initial index creation. This greatly improves acronymGIN/acronym index update speed, even counting the additional --- 728,735 from the indexed item). As of productnamePostgreSQL/productname 8.4, acronymGIN/ is capable of postponing much of this work by inserting new tuples into a temporary, unsorted list of pending entries. !When the table is vacuumed, or if the pending list becomes larger than !xref linkend=guc-pending-list-cleanup-size, the entries are moved to the main acronymGIN/acronym data structure using the same bulk insert techniques used during initial index creation. This greatly improves acronymGIN/acronym index update speed, even counting the additional *** *** 750,756 para If consistent response time is more important than update speed, use of pending entries can be disabled by turning off the !literalFASTUPDATE/literal storage parameter for a acronymGIN/acronym index. See xref linkend=sql-createindex for details. /para --- 750,756 para If consistent response time is more important than update speed, use of pending entries can be disabled by turning off the !literalfastupdate/literal storage parameter for a acronymGIN/acronym index. See xref linkend=sql-createindex for details. /para *** *** 812,829 /varlistentry varlistentry !termxref linkend=guc-work-mem/term listitem para During a series of insertions into an existing acronymGIN/acronym ! index that has literalFASTUPDATE/ enabled, the system will clean up the pending-entry list whenever the list grows larger than ! varnamework_mem/. To avoid fluctuations in observed response time, ! it's desirable to have pending-list cleanup occur in the background ! (i.e., via autovacuum). Foreground cleanup operations can be avoided by ! increasing varnamework_mem/ or making autovacuum more aggressive. ! However, enlarging varnamework_mem/ means that if a foreground ! cleanup does occur, it will take even longer. /para /listitem /varlistentry --- 812,837 /varlistentry varlistentry !termxref linkend=guc-pending-list-cleanup-size/term listitem para During a series of insertions into an existing acronymGIN/acronym ! index that has literalfastupdate/ enabled, the system will clean up the pending-entry list whenever the list grows larger than !
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
Greg Sabino Mullane g...@turnstep.com wrote: Kevin Grittner wrote: (wording change suggestion) | sees a snapshot as of the start of the first query within the | transaction, not as of the start of the current query within the | transaction. Would that have prevented the confusion here? I think it may have, but I also think the wording should be much stronger and clearer, as this is unintuitive behavior. Consider this snippet from Bruce's excellent MVCC Unmasked presentation: A snapshot is recorded at the start of each SQL statement in READ COMMITTED transaction isolation mode, and at transaction start in SERIALIZABLE transaction isolation mode. This is both correct and incorrect, depending on whether you consider a transaction to start with BEGIN; or with the first statement after the BEGIN. :) I think most people have always assumed that BEGIN starts the transaction and that is the point at which the snapshot is obtained. But there is so much evidence to the contrary. Not only does the *name* of the command (BEGIN or START) imply a start, but pg_stat_activity shows the connection idle in transaction after the command (and before a snapshot is acquired), the Explicit Locking section of the docs asserts that Once acquired, a lock is normally held till end of transaction, and the docs for the SET command assert that The effects of SET LOCAL last only till the end of the current transaction, whether committed or not. The end of *which* transaction? The one that started with BEGIN or START and which might not (or in some cases *must* not) yet have a snapshot. But what about creating a flag to BEGIN and SET TRANSACTION commands, called IMMEDIATE FREEZE (or something similar), which applies only to REPEATABLE READ and SERIALIZABLE? If this flag is set (and may be off by default, but of course the default may be configurable via a guc parameter), freeze happens when it is present (BEGIN or SET TRANSACTION) time. This would be a backwards-compatible change, while would provide the option of freezing without the nasty hack of having to do a SELECT 1 prior to your real queries, and everything will of course be well documented. What is the use case where you are having a problem? This seems like an odd solution, so it would be helpful to know what problem it is attempting to solve. Seems like a decent solution to me. The problem it that having to execute a dummy SQL statement to start a serializable transaction, rather than simply a BEGIN, is ugly.and error prone. Perhaps their app assumes (or even requires) that BEGIN starts the snapshot. Why? This fix might not deal with the bigger issues that I discussed, like that the later-to-start and later-to-acquire-a-snapshot transaction might logically be first in the apparent order of execution. You can't fix that without a lot of blocking -- that most of us don't want. Depending on *why* they think this is important, they might need to be acquiring various locks to prevent behavior they don't want, in which case having acquired a snapshot at BEGIN would be exactly the *wrong* thing to do. The exact nature of the problem we're trying to solve here does matter. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REINDEX CONCURRENTLY 2.0
On 10/1/14 3:00 AM, Michael Paquier wrote: - Use of AccessExclusiveLock when swapping relfilenodes of an index and its concurrent entry instead of ShareUpdateExclusiveLock for safety. At the limit of my understanding, that's the consensus reached until now. I'm very curious about this point. I looked through all the previous discussions, and the only time I saw this mentioned was at the very beginning when it was said that we could review the patch while ignoring this issue and fix it later with MVCC catalog access. Then it got very technical, but it was never explicitly concluded whether it was possible to fix this or not. Also, in the thread Concurrently option for reindexdb you pointed out that requiring an exclusive lock isn't really concurrent and proposed an option like --minimum-locks. I will point out again that we specifically invented DROP INDEX CONCURRENTLY because holding an exclusive lock even briefly isn't good enough. If REINDEX cannot work without an exclusive lock, we should invent some other qualifier, like WITH FEWER LOCKS. It's still useful, but we shouldn't give people the idea that they can throw away their custom CREATE INDEX CONCURRENTLY + DROP INDEX CONCURRENTLY scripts. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
Kevin Grittner kgri...@ymail.com writes: Why? This fix might not deal with the bigger issues that I discussed, like that the later-to-start and later-to-acquire-a-snapshot transaction might logically be first in the apparent order of execution. You can't fix that without a lot of blocking -- that most of us don't want. Depending on *why* they think this is important, they might need to be acquiring various locks to prevent behavior they don't want, in which case having acquired a snapshot at BEGIN would be exactly the *wrong* thing to do. The exact nature of the problem we're trying to solve here does matter. Another thing that I think hasn't been mentioned in this thread is that we used to have severe problems with client libraries that like to issue BEGIN and then go idle until they have something to do. Which, for some reason, is a prevalent behavior. That used to result in problems like VACUUM not being able to clean up dead rows promptly. We fixed that some time ago by making sure we didn't acquire an XID until the first actual statement after BEGIN. Snapshots as such were never a problem for this, because we've never acquired a snapshot immediately at BEGIN ... but if we did so, this problem would come right back. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Amazon Redshift
--On 5. November 2014 23:36:03 +0100 philip taylor philta...@mail.com wrote: Date Functions http://docs.aws.amazon.com/redshift/latest/dg/r_ADD_MONTHS.html http://docs.aws.amazon.com/redshift/latest/dg/r_DATEADD_function.html http://docs.aws.amazon.com/redshift/latest/dg/r_DATEDIFF_function.html http://docs.aws.amazon.com/redshift/latest/dg/r_LAST_DAY.html http://docs.aws.amazon.com/redshift/latest/dg/r_MONTHS_BETWEEN_function.h tml http://docs.aws.amazon.com/redshift/latest/dg/r_NEXT_DAY.html fyi, except DATE_DIFF() all of these functions can be installed by using the orafce extension i believe. -- Thanks Bernd -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Replying to some of your comments below. The rest were trivial issues that I'll just fix. On 10/30/2014 09:19 PM, Andres Freund wrote: * Is it really a good idea to separate XLogRegisterBufData() from XLogRegisterBuffer() the way you've done it ? If we ever actually get a record with a large numbers of blocks touched this issentially is O(touched_buffers*data_entries). Are you worried that the linear search in XLogRegisterBufData(), to find the right registered_buffer struct, might be expensive? If that ever becomes a problem, a simple fix would be to start the linear search from the end, and make sure that when you touch a large number of blocks, you do all the XLogRegisterBufData() calls right after the corresponding XLogRegisterBuffer() call. I've also though about having XLogRegisterBuffer() return the pointer to the struct, and passing it as argument to XLogRegisterBufData. So the pattern in WAL generating code would be like: registered_buffer *buf0; buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD); XLogRegisterBufData(buf0, data, length); registered_buffer would be opaque to the callers. That would have potential to turn XLogRegisterBufData into a macro or inline function, to eliminate the function call overhead. I played with that a little bit, but the difference in performance was so small that it didn't seem worth it. But passing the registered_buffer pointer, like above, might not be a bad thing anyway. * There's lots of functions like XLogRecHasBlockRef() that dig through the wal record. A common pattern is something like: if (XLogRecHasBlockRef(record, 1)) XLogRecGetBlockTag(record, 1, NULL, NULL, oldblk); else oldblk = newblk; I think doing that repeatedly is quite a bad idea. We should parse the record once and then use it in a sensible format. Not do it in pieces, over and over again. It's not like we ignore backup blocks - so doing this lazily doesn't make sense to me. Especially as ValidXLogRecord() *already* has parsed the whole damn thing once. Hmm. Adding some kind of a parsed XLogRecord representation would need a fair amount of new infrastructure. Vast majority of WAL records contain one, maybe two, block references, so it's not that expensive to find the right one, even if you do it several times. I ran a quick performance test on WAL replay performance yesterday. I ran pgbench for 100 transactions with WAL archiving enabled, and measured the time it took to replay the generated WAL. I did that with and without the patch, and I didn't see any big difference in replay times. I also ran perf on the startup process, and the profiles looked identical. I'll do more comprehensive testing later, with different index types, but I'm convinced that there is no performance issue in replay that we'd need to worry about. If it mattered, a simple optimization to the above pattern would be to have XLogRecGetBlockTag return true/false, indicating if the block reference existed at all. So you'd do: if (!XLogRecGetBlockTag(record, 1, NULL, NULL, oldblk)) oldblk != newblk; On the other hand, decomposing the WAL record into parts, and passing the decomposed representation to the redo routines would allow us to pack the WAL record format more tightly, as accessing the different parts of the on-disk format wouldn't then need to be particularly fast. For example, I've been thinking that it would be nice to get rid of the alignment padding in XLogRecord, and between the per-buffer data portions. We could copy the data to aligned addresses as part of the decomposition or parsing of the WAL record, so that the redo routines could still assume aligned access. * I wonder if it wouldn't be worthwile, for the benefit of the FPI compression patch, to keep the bkp block data after *all* the headers. That'd make it easier to just compress the data. Maybe. If we do that, I'd also be inclined to move all the bkp block headers to the beginning of the WAL record, just after the XLogInsert struct. Somehow it feels weird to have a bunch of header structs sandwiched between the rmgr-data and per-buffer data. Also, 4-byte alignment is enough for the XLogRecordBlockData struct, so that would be an easy way to make use of the space currently wasted for alignment padding in XLogRecord. * I think heap_xlog_update is buggy for wal_level=logical because it computes the length of the tuple using tuplen = recdataend - recdata; But the old primary key/old tuple value might be stored there as well. Afaics that code has to continue to use xl_heap_header_len. No, the old primary key or tuple is stored in the main data portion. That tuplen computation is done on backup block 0's data. * It looks to me like insert wal logging could just use REGBUF_KEEP_DATA to get rid of: + /* +* The new tuple is normally stored as buffer 0's data. But if +* XLOG_HEAP_CONTAINS_NEW_TUPLE
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Wed, Nov 5, 2014 at 9:26 PM, Robert Haas robertmh...@gmail.com wrote: Yes, I think that's probably a net improvement in robustness quite apart from what we decide to do about any of the rest of this. I've attached it here as revise-procglobal-tracking.patch and will commit that bit if nobody objects. The remainder is reattached without change as group-locking-v0.1.patch. Per your other comment, I've developed the beginnings of a testing framework which I attached here as test_group_locking-v0.patch. Urk. I attached a version with some stupid bugs. Here's a slightly better one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/contrib/test_group_locking/Makefile b/contrib/test_group_locking/Makefile new file mode 100644 index 000..2d09341 --- /dev/null +++ b/contrib/test_group_locking/Makefile @@ -0,0 +1,21 @@ +# contrib/test_group_locking/Makefile + +MODULE_big = test_group_locking +OBJS = test_group_locking.o $(WIN32RES) +PGFILEDESC = test_group_locking - test harness for group locking + +EXTENSION = test_group_locking +DATA = test_group_locking--1.0.sql + +REGRESS = test_group_locking + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/test_group_locking +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/test_group_locking/test_group_locking--1.0.sql b/contrib/test_group_locking/test_group_locking--1.0.sql new file mode 100644 index 000..adb2be5 --- /dev/null +++ b/contrib/test_group_locking/test_group_locking--1.0.sql @@ -0,0 +1,8 @@ +/* contrib/test_group_locking/test_group_locking--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use CREATE EXTENSION test_group_locking to load this file. \quit + +CREATE FUNCTION test_group_locking(spec pg_catalog.text) +RETURNS pg_catalog.void STRICT + AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/contrib/test_group_locking/test_group_locking.c b/contrib/test_group_locking/test_group_locking.c new file mode 100644 index 000..904beff --- /dev/null +++ b/contrib/test_group_locking/test_group_locking.c @@ -0,0 +1,1071 @@ +/*-- + * + * test_group_locking.c + * Test harness code for group locking. + * + * Copyright (C) 2013, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/test_shm_mq/test.c + * + * - + */ + +#include postgres.h + +#include access/xact.h +#include catalog/namespace.h +#include commands/dbcommands.h +#include fmgr.h +#include lib/ilist.h +#include lib/stringinfo.h +#include libpq/libpq.h +#include libpq/pqformat.h +#include libpq/pqmq.h +#include mb/pg_wchar.h +#include miscadmin.h +#include nodes/makefuncs.h +#include parser/scansup.h +#include storage/ipc.h +#include storage/lmgr.h +#include storage/procarray.h +#include storage/shm_mq.h +#include storage/shm_toc.h +#include utils/builtins.h +#include utils/hsearch.h +#include utils/memutils.h +#include utils/resowner.h + +PG_MODULE_MAGIC; + +PG_FUNCTION_INFO_V1(test_group_locking); + +void test_group_locking_worker_main(Datum main_arg); + +/* Names of lock modes, for debug printouts */ +static const char *const lock_mode_names[] = +{ + INVALID, + AccessShareLock, + RowShareLock, + RowExclusiveLock, + ShareUpdateExclusiveLock, + ShareLock, + ShareRowExclusiveLock, + ExclusiveLock, + AccessExclusiveLock +}; + +typedef enum +{ + TGL_START, + TGL_STOP, + TGL_LOCK, + TGL_UNLOCK +} TestGroupLockOp; + +typedef struct +{ + TestGroupLockOp op; + LOCKMODE lockmode; + Oid relid; +} TestGroupLockCommand; + +typedef struct +{ + dlist_node node; + bool verified; + int group_id; + int task_id; + TestGroupLockCommand command; +} TestGroupLockStep; + +typedef struct +{ + int group_id; + int task_id; +} worker_key; + +typedef struct +{ + worker_key key; + dsm_segment *seg; + BackgroundWorkerHandle *handle; + shm_mq_handle *requesth; + shm_mq_handle *responseh; + bool awaiting_response; +} worker_info; + +typedef struct +{ + int group_id; + int leader_task_id; + bool has_followers; +} leader_info; + +/* Fixed-size data passed via our dynamic shared memory segment. */ +typedef struct worker_fixed_data +{ + Oid database_id; + Oid authenticated_user_id; + NameData database; + NameData authenticated_user; + bool use_group_locking; + pid_t leader_pid; +} worker_fixed_data; + +#define SHM_QUEUE_SIZE 32768 +#define TEST_GROUP_LOCKING_MAGIC 0x4c616e65 + +static void check_for_messages(HTAB *worker_hash); +static void determine_leader_info(dlist_head *plan, HTAB *leader_hash); +static void handle_sigterm(SIGNAL_ARGS); +static void process_message(HTAB *worker_hash, worker_info *info, +char *message, Size message_bytes); +static void rationalize_steps(dlist_head
Re: [HACKERS] split builtins.h to quote.h
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Le 14/10/2014 10:00, Michael Paquier a écrit : On Mon, Oct 13, 2014 at 11:14 PM, Robert Haas robertmh...@gmail.com wrote: IMHO, putting some prototypes for a .c file in one header and others in another header is going to make it significantly harder to figure out which files you need to #include when. Keeping a simple rule there seems essential to me. OK. Let's make the separation clear then. The attached does so, and moves the remaining quote_* functions previously in builtins.h to quote.h. I am adding it to the upcoming CF for tracking the effort on it. Hi Michael, I just reviewed this patch : * applies cleanly to master(d2b8a2c7) * all regression tests pass As it's only moving functions from builtins.h to quote.h and update impacted files, nothing special to add. It will probably break some user extensions using quote_* functions. Otherwise it looks ready to commit. Regards. - -- Julien Rouhaud http://dalibo.com - http://dalibo.org -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) iQEcBAEBAgAGBQJUW6wEAAoJELGaJ8vfEpOqCn4H/AydFB5ERI0x9R6l8T/Qkx9/ Tm0ZgSSsfG39lHkbspZaUwTxmNPam1+EueQrw2VQcT3JXjWaHWvurWjFDBdSi8Ar fgcD4WOTcyenxsckq5/XwT++ZfOZa1h2qBABoC4nx1qcO4pNBW51ywL11Fmcb7O8 CkwKZ3FfUlVFLsh2Novqa7HtEWdi872zzb4TSxQjsShMuFNX/6PF6RFJVZAy61VA sbVQ/0rRQ9bOcJgh+DDaIMVQAnOUWsvYdR9VbRCbgcZuBlZKeW7gt9qGgevgjcun PYB+ZuSC92WiS9y3OhcGKp9cYQpcsOD5ZDxe1Cw7q4YNZNgUtbKpDW6GsTC+vp0= =NH1j -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tweaking Foreign Keys for larger tables
On 11/6/14, 2:58 AM, Simon Riggs wrote: On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote: On 10/31/14 6:19 AM, Simon Riggs wrote: Various ways of tweaking Foreign Keys are suggested that are helpful for larger databases. *INITIALLY NOT ENFORCED FK created, but is not enforced during DML. Will be/Must be marked NOT VALID when first created. We can run a VALIDATE on the constraint at any time; if it passes the check it is marked VALID and presumed to stay that way until the next VALIDATE run. Does that mean the FK would become invalid after every DML operation, until you expicitly revalidate it? Is that practical? I think so. We store the validity on the relcache entry. Constraint would add a statement-level after trigger for insert, update, delete and trigger, which issues a relcache invalidation if the state was marked valid. Marked as deferrable initially deferred. I don't think you'd need to invalidate on insert, or on an update that didn't touch a referenced key. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] split builtins.h to quote.h
Julien Rouhaud wrote: I just reviewed this patch : * applies cleanly to master(d2b8a2c7) * all regression tests pass As it's only moving functions from builtins.h to quote.h and update impacted files, nothing special to add. It will probably break some user extensions using quote_* functions. Otherwise it looks ready to commit. I thought the consensus was that the SQL-callable function declarations should remain in builtins.h -- mainly so that quote.h does not need to include fmgr.h. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tweaking Foreign Keys for larger tables
On Thu, Nov 6, 2014 at 10:29 AM, Jim Nasby-5 [via PostgreSQL] ml-node+s1045698n582596...@n5.nabble.com wrote: On 11/6/14, 2:58 AM, Simon Riggs wrote: On 5 November 2014 21:15, Peter Eisentraut [hidden email] http://user/SendEmail.jtp?type=nodenode=5825967i=0 wrote: On 10/31/14 6:19 AM, Simon Riggs wrote: Various ways of tweaking Foreign Keys are suggested that are helpful for larger databases. *INITIALLY NOT ENFORCED FK created, but is not enforced during DML. Will be/Must be marked NOT VALID when first created. We can run a VALIDATE on the constraint at any time; if it passes the check it is marked VALID and presumed to stay that way until the next VALIDATE run. Does that mean the FK would become invalid after every DML operation, until you expicitly revalidate it? Is that practical? I think so. We store the validity on the relcache entry. Constraint would add a statement-level after trigger for insert, update, delete and trigger, which issues a relcache invalidation if the state was marked valid. Marked as deferrable initially deferred. I don't think you'd need to invalidate on insert, Why? Since the FK is not enforced there is no guarantee that what you just inserted is valid or on an update that didn't touch a referenced key. OK - but you would still need the trigger on the FK columns DELETE is OK as well since you cannot invalidate the constraint by simply removing the referencing row. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Tweaking-Foreign-Keys-for-larger-tables-tp5825162p5825970.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Re: [HACKERS] What exactly is our CRC algorithm?
On 11/04/2014 03:21 PM, Robert Haas wrote: On Tue, Nov 4, 2014 at 4:47 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I hear none, so committed with some small fixes. Are you going to get the slice-by-N stuff working next, to speed this up? I don't have any concrete plans, but yeah, that would be great. So definitely maybe. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] json, jsonb, and casts
In 9.3 we changed the way json generating functions worked by taking account of cast functions to json from non-builtin types, such as hstore. In 9.5 I am proposing to provide similar functionality for jsonb. The patch actually takes account of cast functions to both jsonb and json (with jsonb preferred). If there is a cast to jsonb, we use it and then merge the result into the jsonb being accumulated. If there is just a cast to json, we use it, and then parse that directly into the result datum. It was arguably a bit of an oversight not to take account of casts to jsonb in 9.4 in datum_to_json(). So I'm thinking of rolling into this patch changes to json.c::datum_to_json() and friends to take analogous account of casts to jsonb (i.e. call the cast function, turn the resulting jsonb into a cstring and append it to the result). Thoughts? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tweaking Foreign Keys for larger tables
Simon Riggs wrote: On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote: ON DELETE IGNORE ON UPDATE IGNORE If we allow this specification then the FK is one way - we check the existence of a row in the referenced table, but there is no need for a trigger on the referenced table to enforce an action on delete or update, so no need to lock the referenced table when adding FKs. Are you worried about locking the table at all, or about having to lock many rows? This is useful for smaller, highly referenced tables that don't change much, if ever. In that case the need for correctness thru locking is minimal. If we do lock it will cause very high multixact traffic, so that is worth avoiding alone. This seems like a can of worms to me. How about the ability to mark a table READ ONLY, so that insert/update/delete operations on it raise an error? For such tables, you can just assume that tuples never go away, which can help optimize some ri_triggers.c queries by doing plain SELECT, not SELECT FOR KEY SHARE. If you later need to add rows to the table, you set it READ WRITE, and then ri_triggers.c automatically start using FOR KEY SHARE; add/modify to your liking, then set READ ONLY again. So you incur the cost of tuple locking only while you have the table open for writes. This way we don't get into the mess of reasoning about foreign keys that might be violated some of the time. There's a side effect of tables being READ ONLY which is that tuple freezing can be optimized as well. I vaguely recall we have discussed this. It's something like SET READ ONLY, then freeze it, which sets its relfrozenxid to 0 or maybe FrozenXid; vacuum knows it can ignore the table for freezing purposes. When SET READ WRITE, relfrozenxid jumps to RecentXmin. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json, jsonb, and casts
Andrew Dunstan andrew.duns...@pgexperts.com writes: In 9.3 we changed the way json generating functions worked by taking account of cast functions to json from non-builtin types, such as hstore. In 9.5 I am proposing to provide similar functionality for jsonb. The patch actually takes account of cast functions to both jsonb and json (with jsonb preferred). If there is a cast to jsonb, we use it and then merge the result into the jsonb being accumulated. If there is just a cast to json, we use it, and then parse that directly into the result datum. It was arguably a bit of an oversight not to take account of casts to jsonb in 9.4 in datum_to_json(). So I'm thinking of rolling into this patch changes to json.c::datum_to_json() and friends to take analogous account of casts to jsonb (i.e. call the cast function, turn the resulting jsonb into a cstring and append it to the result). Meh. This leaves it very ambiguous which cast function would be applied if both are available. I think it's overcomplicated anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 10/29/14, 11:49 AM, Jim Nasby wrote: On 10/21/14, 6:05 PM, Tom Lane wrote: Jim Nasby jim.na...@bluetreble.com writes: - What happens if we run out of space to remember skipped blocks? You forget some, and are no worse off than today. (This might be an event worthy of logging, if the array is large enough that we don't expect it to happen often ...) Makes sense. I'll see if there's some reasonable way to retry pages when the array fills up. I'll make the array 2k in size; that allows for 512 pages without spending a bunch of memory. Attached is a patch for this. It also adds logging of unobtainable cleanup locks, and refactors scanning a page for vacuum into it's own function. Anyone reviewing this might want to look at https://github.com/decibel/postgres/commit/69ab22f703d577cbb3d8036e4e42563977bcf74b, which is the refactor with no whitespace changes. I've verified this works correctly by connecting to a backend with gdb and halting it with a page pinned. Both vacuum and vacuum freeze on that table do what's expected, but I also get this waring (which AFAICT is a false positive): decibel@decina.local=# vacuum verbose i; INFO: vacuuming public.i INFO: i: found 0 removable, 399774 nonremovable row versions in 1769 out of 1770 pages DETAIL: 20 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. Retried cleanup lock on 0 pages, retry failed on 1, skipped retry on 0. CPU 0.00s/0.06u sec elapsed 12.89 sec. WARNING: buffer refcount leak: [105] (rel=base/16384/16385, blockNum=0, flags=0x106, refcount=2 1) VACUUM I am doing a simple static allocation of retry_pages[]; my understanding is that will only exist for the duration of this function so it's OK. If not I'll palloc it. If it is OK then I'll do the same for the freeze array. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com From 1752751903a8d51b7b3b618072b6b0687f9f141c Mon Sep 17 00:00:00 2001 From: Jim Nasby jim.na...@bluetreble.com Date: Thu, 6 Nov 2014 14:42:52 -0600 Subject: [PATCH] Vacuum cleanup lock retry This patch will retry failed attempts to obtain the cleanup lock on a buffer. It remembers failed block numbers in an array and retries after vacuuming the relation. The array is currently fixed at 512 entries; additional lock failures will not be re-attempted. This patch also adds counters to report on failures, as well as refactoring the guts of page vacuum scans into it's own function. --- src/backend/commands/vacuumlazy.c | 964 +- 1 file changed, 541 insertions(+), 423 deletions(-) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 3778d9d..240113f 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -96,6 +96,14 @@ */ #define SKIP_PAGES_THRESHOLD ((BlockNumber) 32) +/* + * Instead of blindly skipping pages that we can't immediately acquire a + * cleanup lock for (assuming we're not freezing), we keep a list of pages we + * initially skipped, up to VACUUM_MAX_RETRY_PAGES. We retry those pages at the + * end of vacuuming. + */ +#define VACUUM_MAX_RETRY_PAGES 512 + typedef struct LVRelStats { /* hasindex = true means two-pass strategy; false means one-pass */ @@ -143,6 +151,10 @@ static void lazy_vacuum_index(Relation indrel, static void lazy_cleanup_index(Relation indrel, IndexBulkDeleteResult *stats, LVRelStats *vacrelstats); +static void lazy_scan_page(Relation onerel, LVRelStats *vacrelstats, + BlockNumber blkno, Buffer buf, Buffer vmbuffer, xl_heap_freeze_tuple *frozen, + int nindexes, bool all_visible_according_to_vm, + BlockNumber *empty_pages, BlockNumber *vacuumed_pages, double *nunused); static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer); static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats); @@ -422,13 +434,15 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, { BlockNumber nblocks, blkno; - HeapTupleData tuple; char *relname; BlockNumber empty_pages, - vacuumed_pages; - double num_tuples, - tups_vacuumed, - nkeep, + vacuumed_pages, + retry_pages[VACUUM_MAX_RETRY_PAGES]; + int retry_pages_insert_ptr; + double retry_page_count, + retry_fail_count, + retry_pages_skipped, + cleanup_lock_waits,
Re: [HACKERS] Tweaking Foreign Keys for larger tables
Alvaro Herrera alvhe...@2ndquadrant.com wrote: Simon Riggs wrote: On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote: ON DELETE IGNORE ON UPDATE IGNORE If we allow this specification then the FK is one way - we check the existence of a row in the referenced table, but there is no need for a trigger on the referenced table to enforce an action on delete or update, so no need to lock the referenced table when adding FKs. Are you worried about locking the table at all, or about having to lock many rows? This is useful for smaller, highly referenced tables that don't change much, if ever. In that case the need for correctness thru locking is minimal. If we do lock it will cause very high multixact traffic, so that is worth avoiding alone. This seems like a can of worms to me. How about the ability to mark a table READ ONLY, so that insert/update/delete operations on it raise an error? For such tables, you can just assume that tuples never go away, which can help optimize some ri_triggers.c queries by doing plain SELECT, not SELECT FOR KEY SHARE. If you later need to add rows to the table, you set it READ WRITE, and then ri_triggers.c automatically start using FOR KEY SHARE; add/modify to your liking, then set READ ONLY again. So you incur the cost of tuple locking only while you have the table open for writes. This way we don't get into the mess of reasoning about foreign keys that might be violated some of the time. On its face, that sounds more promising to me. There's a side effect of tables being READ ONLY which is that tuple freezing can be optimized as well. I vaguely recall we have discussed this. It's something like SET READ ONLY, then freeze it, which sets its relfrozenxid to 0 or maybe FrozenXid; vacuum knows it can ignore the table for freezing purposes. When SET READ WRITE, relfrozenxid jumps to RecentXmin. It could also allow a (potentially large) optimization to serializable transactions -- there is no need to take any predicate locks on a table or its indexes if it is read only. To safely transition a table from read only to read write you would need at least two flags (similar in some ways to indisvalid and indisready) -- one to say whether any of these read only optimizations are allowed, and another flag that would only be set after all transactions which might have seen the read only state have completed which actually allows writes. Or that could be done with a char column with three states. So on transition to read only you would flag it as non-writable, and after all transactions which might have seen it in a writable state complete you flag it as allowing read only optimizations. To transition to read write you disable the optimizations first and wait before actually flagging it as read write. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #11867: Strange behaviour with composite types after resetting database tablespace
On 11/4/14, 10:45 AM, Tom Lane wrote: So it's safe as things stand; but this seems a bit, um, rickety. Should we insert a DropRelFileNodesAllBuffers call into ATExecSetTableSpace to make it safer? It's kind of annoying to have to scan the buffer pool twice, but I'm afraid that sometime in the future somebody will try to get clever about optimizing away the end-of-transaction buffer pool scan, and break this case. Or maybe I'm just worrying too much. We could possibly remember what relations have been moved to different tablespaces in a transaction and avoid the call in that case, but that seems rather silly. If there's any non-trivial actual data involved then presumably the buffer scan won't be noticed amidst all the IO, so this would only be an issue if you're playing games with mostly empty tables, and only then if you've got really large shared buffers. I can't actually come up with any real use case for that. So +1 for doing the safe thing... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 2014-11-06 14:55:37 -0600, Jim Nasby wrote: On 10/29/14, 11:49 AM, Jim Nasby wrote: On 10/21/14, 6:05 PM, Tom Lane wrote: Jim Nasby jim.na...@bluetreble.com writes: - What happens if we run out of space to remember skipped blocks? You forget some, and are no worse off than today. (This might be an event worthy of logging, if the array is large enough that we don't expect it to happen often ...) Makes sense. I'll see if there's some reasonable way to retry pages when the array fills up. I'll make the array 2k in size; that allows for 512 pages without spending a bunch of memory. Attached is a patch for this. It also adds logging of unobtainable cleanup locks, and refactors scanning a page for vacuum into it's own function. Anyone reviewing this might want to look at https://github.com/decibel/postgres/commit/69ab22f703d577cbb3d8036e4e42563977bcf74b, which is the refactor with no whitespace changes. I've verified this works correctly by connecting to a backend with gdb and halting it with a page pinned. Both vacuum and vacuum freeze on that table do what's expected, but I also get this waring (which AFAICT is a false positive): I think the retry logical is a largely pointless complication of already complex enough code. You're fixing a problem for which there is absolutely no evidence of its existance. Yes, this happens occasionally. But it's going to be so absolutely minor in comparison to just about every other source of bloat. So, I pretty strongly against retrying. I could live with adding logging of the number of pages skipped due to not being able to acquire the cleanup lock. I don't think that's going to do help many people, but the cost is pretty low. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] ltree::text not immutable?
On 11/5/14, 7:38 PM, Robert Haas wrote: Attached is a complete patch along these lines. As I suggested earlier, this just makes the relevant changes in ltree--1.0.sql and pg_trgm--1.1.sql without bumping their extension version numbers, since it doesn't seem important enough to justify a version bump. I don't understand why you went to all the trouble of building a versioning system for extensions if you're not going to use it. I was about to dismiss this comment since I don't see any real need for a version bump here, except that AFAIK there's no way to upgrade an extension without bumping the version, other than resorting to hackery. So I think this does need to be an upgrade. :( -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
Jeff Janes wrote: On Wed, Nov 5, 2014 at 12:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Thanks for the updated patch. Now when I run the test program (version with better error reporting attached), it runs fine until I open a psql session and issue: reindex table foo; Interesting. This was a more general issue actually -- if you dropped the index at that point and created it again, the resulting index would also be corrupt in the same way. Inspecting with the supplied pageinspect functions made the situation pretty obvious. The old code was skipping page ranges in which it could not find any tuples, but that's bogus and inefficient. I changed an if into a loop that inserts intermediary tuples, if any are needed. I cannot reproduce that problem anymore. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services brin-24.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 11/5/14, 8:26 PM, Robert Haas wrote: rhaas=# create table foo (a int); CREATE TABLE rhaas=# select test_group_locking('1.0:start,2.0:start,1.0:lock:AccessExclusiveLock:foo,2.0:lock:AccessExclusiveLock:foo'); NOTICE: starting worker 1.0 NOTICE: starting worker 2.0 NOTICE: instructing worker 1.0 to acquire AccessExclusiveLock on relation with OID 16387 NOTICE: instructing worker 2.0 to acquire AccessExclusiveLock on relation with OID 16387 Damn that's cool to see! :) FWIW, some creative use of a composite type, plpgsql and string_to_array() would have saved a bunch of parsing code... it's surprisingly easy to parse stuff like this using string_to_array(). -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tweaking Foreign Keys for larger tables
On 11/6/14, 11:49 AM, David G Johnston wrote: Constraint would add a statement-level after trigger for insert, update, delete and trigger, which issues a relcache invalidation if the state was marked valid. Marked as deferrable initially deferred. I don't think you'd need to invalidate on insert, Why? Since the FK is not enforced there is no guarantee that what you just inserted is valid I'm talking about the referenced (aka 'parent') table, not the referring table. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] ltree::text not immutable?
Jim Nasby jim.na...@bluetreble.com writes: On 11/5/14, 7:38 PM, Robert Haas wrote: I don't understand why you went to all the trouble of building a versioning system for extensions if you're not going to use it. I was about to dismiss this comment since I don't see any real need for a version bump here, except that AFAIK there's no way to upgrade an extension without bumping the version, other than resorting to hackery. Well, the one or two users who actually care can fix the problem with a manual ALTER FUNCTION. I'm not sure it's worth making everyone else jump through update hoops. If it hadn't taken us years to even notice the issue, I might be more willing to expend work here, but I really doubt the audience is larger than one or two people ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund and...@2ndquadrant.com wrote: Maybe we can, as a first step, make those edges in the lock graph visible to the deadlock detector? It's pretty clear that undetected deadlocks aren't ok, but detectable deadlocks in a couple corner cases might be acceptable. An interesting point came to mind this morning on this topic, while discussing this issue with Amit. Suppose process A1 holds a lock on some object and process A2, a member of the same locking group, waits for a lock on that same object. It follows that there must exist a process B which either *holds* or *awaits* a lock which conflicts with the one requested by A2; otherwise, A2 would have been granted the lock at once. If B *holds* the conflicting lock, it may eventually release it; but if B *awaits* the conflicting lock, we have a soft deadlock, because A2 waits for B waits for A1, which we presume to wait for A1.[1] The logical consequence of this is that, if a process seeks to acquire a lock which is already held (in any mode) by a fellow group-member, it had better jump over the entire wait queue and acquire the lock immediately if no conflicting locks have already been granted. If it fails to do this, it immediately creates a soft deadlock. What if there *are* conflicting locks already granted? In that case, things are more complicated. We could, for example, have this situation: A1 holds AccessShareLock, B holds ExclusiveLock, C1 awaits ShareLock, and then (following it in the wait queue) A2 awaits RowExclusiveLock. This is not a deadlock, because B can finish, then C can get the lock, then C1 can finish, then A2 can get the lock. But if C2 now waits for some member of group A, then we have a soft deadlock between group A and group C, which can be resolved by moving A2's request ahead of C1. (This example is relatively realistic, too: a lock upgrade from AccessShareLock to RowExclusiveLock is probably the most common type of lock upgrade, for reasons that are hopefully obvious.) In practice, I suspect it's best to take a more aggressive approach and have A2 jump straight to the head of the line right out of the chute. Letting some parallel workers make progress while holding up others out of a notion of locking fairness is probably stupid, because they're probably dependent on each other in some way that makes it useless for one to make progress while the others don't. For the same reason, I'd argue that we ought to wait until all pending lock requests from a given group can be satisfied before granting any of them. In fact, I suspect it's best in general for the locking machinery and the deadlock detector to view several simultaneous, pending lock requests as if they were a single request conflicting with everything that would conflict with any of the requested modes. Consider the situation where there are 10 lock waiters, 5 in each of two parallel groups. The deadlock detector could spend a good deal of time and energy trying various reorderings of the lock queue, but in practice it seems highly unlikely that it's sensible to do anything other than put all the locks from one group first and all of the locks from the other group afterwards, regardless of whether the processes want the same mode or different modes. Most of the other orderings will either be equivalent to one of those orderings (because commuting two non-conflicting locks in the wait queue has no effect) or soft deadlocks (because the interleave a conflicting lock request between two requests form the same group). And even if you can find some corner case where neither of those things is the case, it's probably still not useful to let half the lock group run and leave the other half blocked. What will probably happen is that some internal queue will pretty quickly either fill up or drain completely and the process filling it, or draining it, will wait. Now we're no better off than if we'd waited to grant the lock in the first place, and we're now holding more locks that can block other things or cause deadlocks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] Here I'm assuming, as in previous discussion, that we regard all members of a parallel group as mutually waiting for each other, on the theory that the parallel computation won't complete until all workers complete and, therefore, the members of the group are either already waiting for each other or eventually will. While this might not be true in every case, I believe it's a good (and generally valid) simplifying assumption. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] json, jsonb, and casts
On 11/06/2014 03:58 PM, Tom Lane wrote: Andrew Dunstan andrew.duns...@pgexperts.com writes: In 9.3 we changed the way json generating functions worked by taking account of cast functions to json from non-builtin types, such as hstore. In 9.5 I am proposing to provide similar functionality for jsonb. The patch actually takes account of cast functions to both jsonb and json (with jsonb preferred). If there is a cast to jsonb, we use it and then merge the result into the jsonb being accumulated. If there is just a cast to json, we use it, and then parse that directly into the result datum. It was arguably a bit of an oversight not to take account of casts to jsonb in 9.4 in datum_to_json(). So I'm thinking of rolling into this patch changes to json.c::datum_to_json() and friends to take analogous account of casts to jsonb (i.e. call the cast function, turn the resulting jsonb into a cstring and append it to the result). Meh. This leaves it very ambiguous which cast function would be applied if both are available. I think it's overcomplicated anyway. OK, then we can do one of these: * just honor casts to json, whether generating json or jsonb, or * just honor casts to json when generating json (as now) and just casts to jsonb when generating jsonb, ignoring the other casts in both cases. I can see a case for each, so I won't be too fussed either way. On balance I probably favor the first option, as it means you would only need to supply one cast function to have the type behave as you want, and functions providing casts to json are going to be a LOT easier to write. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload
On 10/16/14 11:34 PM, Craig Ringer wrote: psql: FATAL: Peer authentication failed for user fred HINT: See the server error log for additional information. I think this is wrong for many reasons. I have never seen an authentication system that responds with, hey, what you just did didn't get you in, but the administrators are currently in the process of making a configuration change, so why don't you check that out. We don't know whether the user has access to the server log. They probably don't. Also, it is vastly more likely that the user really doesn't have access in the way they chose, so throwing in irrelevant hints will be distracting. Moreover, it will be confusing to regular users if this message sometimes shows up and sometimes doesn't, independent of their own state and actions. Finally, the fact that a configuration change is in progress is privileged information. Unprivileged users can deduct from the presence of this message that administrators are doing something, and possibly that they have done something wrong. I think it's fine to log a message in the server log if the pg_hba.conf file needs reloading. But the client shouldn't know about this at all. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] split builtins.h to quote.h
On Fri, Nov 7, 2014 at 2:31 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Julien Rouhaud wrote: I just reviewed this patch : * applies cleanly to master(d2b8a2c7) * all regression tests pass As it's only moving functions from builtins.h to quote.h and update impacted files, nothing special to add. It will probably break some user extensions using quote_* functions. Otherwise it looks ready to commit. I thought the consensus was that the SQL-callable function declarations should remain in builtins.h -- mainly so that quote.h does not need to include fmgr.h. Moving everything to quote.h is done in-line with what Tom and Robert suggested, builtins.h being a refuge for function declarations that have nowhere else to go. Suggestion from here: http://www.postgresql.org/message-id/ca+tgmozf3dkptua6ex6gxlnnd-nms-fbjcxroitwffh-+6y...@mail.gmail.com -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On Thu, Nov 6, 2014 at 9:30 PM, Andres Freund and...@2ndquadrant.com wrote: I think the retry logical is a largely pointless complication of already complex enough code. You're fixing a problem for which there is absolutely no evidence of its existance. Yes, this happens occasionally. But it's going to be so absolutely minor in comparison to just about every other source of bloat. I agree bloat isn't really a threat, but what about the relfrozenxid? If we skip even one page we don't get to advance it and retrying could eliminate those skipped pages and allow us to avoid a vacuum freeze which can be really painful. Of course that only works if you can be sure you haven't overflowed and forgotten any skipped pages and if you don't find the page still pinned every time until you eventually give up on it. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 11/6/14, 5:40 PM, Greg Stark wrote: On Thu, Nov 6, 2014 at 9:30 PM, Andres Freund and...@2ndquadrant.com wrote: I think the retry logical is a largely pointless complication of already complex enough code. You're fixing a problem for which there is absolutely no evidence of its existance. Yes, this happens occasionally. But it's going to be so absolutely minor in comparison to just about every other source of bloat. For some reason I don't have Andres' original email, so I'll reply here: I agree with you, and my original proposal was simply to log how many pages were skipped, but that was objected to. Simply logging this extra information would be a patch of a dozen lines or less. The problem right now is there's no way to actually obtain evidence that this is (or isn't) something to worry about, because we just silently skip pages. If we had any kind of tracking on this we could stop guessing. :( I agree bloat isn't really a threat, but what about the relfrozenxid? If we skip even one page we don't get to advance it and retrying could eliminate those skipped pages and allow us to avoid a vacuum freeze which can be really painful. Of course that only works if you can be sure you haven't overflowed and forgotten any skipped pages and if you don't find the page still pinned every time until you eventually give up on it. The overflow part shouldn't be that big a deal. Either we just bump the array size up some more, or worst-case we scan it whenever it fills (like we do when we fill vacrelstats-dead_tuples. But like I said above, I think this is already making a mountain out of a mole-hill, until we have evidence there's a real problem. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 11/01/2014 09:00 PM, Michael Paquier wrote: 1) It is untested and actually there is no direct use for it in core. 2) Pushing code that we know as dead is no good, that's a feature more or less defined as maybe-useful-but-we-are-not-sure-yet-what-to-do-with-it. 3) If you're going to re-use this API in BDR, which is a fork of Postgres. I would like to emphasise that BDR is not really a fork at all, no more than any big topic branch would be a fork. BDR is two major parts: - A collection of patches to core (commit timestamps/extradata, sequence AM, replication identifiers, logical decoding, DDL deparse, event triggers, etc). These are being progressively submitted to core. maintained as multiple feature branches plus a merged version; and - An extension that uses core features and, where necessary, the additions to core to implement bi-directional logical replication. Because of the time scales involved in getting things into core it's been necessary to *temporarily* get the 9.4-based feature branch into wider use so that it can be used to run the BDR extension, but if we can get the required features into core that need will go away. Event triggers and logical decoding were already merged in 9.4. If we can get things like commit timestamps, commit extradata / logical replication identifiers, the sequence access method, etc merged in 9.5 then it should be possible to do away with the need for the patches to core entirely and run BDR on top of stock 9.5. I'd be delighted if that were possible, as doing away with the patched 9.4 would get rid of a great deal of work and frustration on my part. Note that the BDR extension its self is PostgreSQL-licensed. Externally maintained extensions have been bought in-core before. It's a lot of code though, and I can't imagine that being a quick process. You'd better complete this API in BDR by yourself and not bother core with that. This argument would've prevented the inclusion of logical decoding, which is rapidly becoming the headline feature for 9.4, or at least shortly behind jsonb. Slony is being adapted to use it, multiple people are working on auditing systems based on it, and AFAIK EDB's xDB is being adapted to take advantage of it too. As development gets more complex and people attempt bigger features, the One Big Patch that adds a feature and an in-core user of the feature is not going to be a viable approach all the time. In my view it's already well past that, and some recent patches (like RLS) really should've been split up into patch series. If we want to avoid unreviewable monster-patches it will, IMO, be necessary to have progressive, iterative enhancement. That may sometimes mean that there's code in core that's only used by future yet-to-be-merged patches and/or by extensions. Of course its desirable to have an in-tree user of the code wherever possible/practical - but sometimes it may *not* be possible or practical. It seems to me like the benefits of committing work in smaller, more reviewable chunks outweigh the benefits of merging multiple related but separate changes just so everything can have an immediate in-tree user. That's not to say that extradata must remain glued to commit timestamps. It might make more sense as a separate patch with an API to allow extensions to manipulate it directly, plus a dummy extension showing how it works, like we do with various hooks and with APIs like FDWs. However, just like the various hooks that we have, it *does* make sense to have something in-core that has no real world in-core users. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Hi, ow...@postgresql.org] On Behalf Of Robert Haas Sent: Tuesday, October 28, 2014 9:20 PM On Tue, Oct 28, 2014 at 6:06 AM, Andres Freund and...@2ndquadrant.com wrote: In my opinion we can reuse (some of) the existing logic for INHERITS to implement proper partitioning, but that should be an implementation detail. Sure, that would be a sensible way to do it. I mostly care about not throwing out all the work that's been done on the planner and executor. Maybe you're thinking we'll eventually replace that with something better, which is fine, but I wouldn't underestimate the effort to make that happen. For example, I think it's be sensible for the first patch to just add some new user-visible syntax with some additional catalog representation that doesn't actually do all that much yet. Then subsequent patches could use that additional metadata to optimize partition prune, implement tuple routing, etc. I mentioned upthread about the possibility of resurrecting Itagaki-san's patch [1] to try to make things work in this direction. I would be willing to spend time on this. I see useful reviews of the patch by Robert [2], Simon [3] at the time but it wasn't pursued further. I think those reviews were valuable design input that IMHO would still be relevant. It seems the reviews suggested some significant changes to the design proposed. Off course, there are many other considerations discussed upthread that need to be addressed. Incorporating those changes and others, I think such an approach could be worthwhile. Thoughts? [1] https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress [2] http://www.postgresql.org/message-id/aanlktikp-1_8b04eyik0sdf8ua5kmo64o8sorfbze...@mail.gmail.com [3] http://www.postgresql.org/message-id/1279196337.1735.9598.camel@ebony Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On Tue, Nov 4, 2014 at 3:17 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 10/27/2014 06:02 PM, Heikki Linnakangas wrote: I came up with the attached patches. They do three things: 1. Get rid of the 64-bit CRC code. It's not used for anything, and haven't been for years, so it doesn't seem worth spending any effort to fix them. 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't need to remain compatible across major versions. Will this change allow database created before this commit to be started after this commit? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] inherit support for foreign tables
Hello, I don't fully catch up this topic but tried this one. Here are separated patches. fdw-chk.patch - CHECK constraints on foreign tables fdw-inh.patch - table inheritance with foreign tables The latter has been created on top of [1]. [1] http://www.postgresql.org/message-id/540da168.3040...@lab.ntt.co.jp To be exact, it has been created on top of [1] and fdw-chk.patch. I tried both patches on the current head, the newly added parameter to analyze_rel() hampered them from applying but it is easy to fix seemingly and almost all the other part was applied cleanly. By the way, are these the result of simply splitting of your last patch, foreign_inherit-v15.patch? http://www.postgresql.org/message-id/53feef94.6040...@lab.ntt.co.jp The result of apllying whole-in-one version and this splitted version seem to have many differences. Did you added even other changes? Or do I understand this patch wrongly? git diff --numstat 0_foreign_inherit_one 0_foreign_inherit_two 5 51 contrib/file_fdw/file_fdw.c 10 19 contrib/file_fdw/input/file_fdw.source 18 71 contrib/file_fdw/output/file_fdw.source 19 70 contrib/postgres_fdw/expected/postgres_fdw.out 9 66 contrib/postgres_fdw/postgres_fdw.c 12 35 contrib/postgres_fdw/sql/postgres_fdw.sql 13 48 doc/src/sgml/fdwhandler.sgml 39 39 doc/src/sgml/ref/alter_foreign_table.sgml 4 3 doc/src/sgml/ref/create_foreign_table.sgml 8 0 src/backend/catalog/heap.c 7 3 src/backend/commands/analyze.c 0 7 src/backend/commands/tablecmds.c 1 22 src/backend/optimizer/plan/createplan.c 7 7 src/backend/optimizer/prep/prepunion.c 0 26 src/backend/optimizer/util/pathnode.c 1 1 src/include/commands/vacuum.h 0 7 src/include/foreign/fdwapi.h 19 1 src/test/regress/expected/foreign_data.out 9 2 src/test/regress/sql/foreign_data.sql I noticed that the latter disallows TRUNCATE on inheritance trees that contain at least one child foreign table. But I think it would be better to allow it, with the semantics that we quietly ignore the child foreign tables and apply the operation to the child plain tables, which is the same semantics as ALTER COLUMN SET STORAGE on such inheritance trees. Comments welcome. Done. And I've also a bit revised regression tests for both patches. Patches attached. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add generate_series(numeric, numeric)
On Tue, Oct 14, 2014 at 3:22 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Oct 7, 2014 at 8:38 AM, Ali Akbar the.ap...@gmail.com wrote: 2014-10-06 22:51 GMT+07:00 Marti Raudsepp ma...@juffo.org: the one that tests values just before numeric overflow Actually I don't know if that's too useful. I think you should add a test case that causes an error to be thrown. Actually i added the test case because in the code, when adding step into current for the last value, i expected it to overflow: /* increment current in preparation for next iteration */ add_var(fctx-current, fctx-step, fctx-current); where in the last calculation, current is 9 * 10^131071. Plus 10^131071, it will be 10^131072, which i expected to overflow numeric type (in the doc, numeric's range is up to 131072 digits before the decimal point). In attached patch, i narrowed the test case to produce smaller result. Well, as things stand now, the logic relies on cmp_var and the sign of the stop and current values. it is right that it would be better to check for overflow before going through the next iteration, and the cleanest and cheapest way to do so would be to move the call to make_result after add_var and save the result variable in the function context, or something similar, but honestly I am not sure it is worth the complication as it would mean some refactoring on what make_result actually already does. I looked at this patch again a bit and finished with the attached, adding an example in the docs, refining the error messages and improving a bit the regression tests. I have nothing more to comment, so I am marking this patch as ready for committer. The patch looks good to me. Barring any objection I will commit the patch. Memo for me: CATALOG_VERSION_NO must be changed at the commit. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tweaking Foreign Keys for larger tables
On 6 November 2014 20:47, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Simon Riggs wrote: ... In that case the need for correctness thru locking is minimal. If we do lock it will cause very high multixact traffic, so that is worth avoiding alone. This seems like a can of worms to me. How about the ability to mark a table READ ONLY, so that insert/update/delete operations on it raise an error? For such tables, you can just assume that tuples never go away, which can help optimize some ri_triggers.c queries by doing plain SELECT, not SELECT FOR KEY SHARE. If you later need to add rows to the table, you set it READ WRITE, and then ri_triggers.c automatically start using FOR KEY SHARE; add/modify to your liking, then set READ ONLY again. So you incur the cost of tuple locking only while you have the table open for writes. How about we set lock level on each Foreign Key like this [USING LOCK [lock level]] level is one of KEY - [FOR KEY SHARE] - default ROW - [FOR SHARE] TABLE SHARE - [ ] TABLE EXCLUSIVE - [FOR TABLE EXCLUSIVE] which introduces these new level descriptions TABLE SHARE - is default behavior of SELECT TABLE EXCLUSIVE - we lock the referenced table against all writes - this allows the table to be fully cached for use in speeding up checks [FOR TABLE EXCLUSIVE] - uses ShareRowExclusiveLock The last level is like Read Only tables apart from the fact that they can be written to when needed, but we optimize things on the assumption that such writes are very rare. We could also add Read Only tables as well, but I don't see as much use for them. Sounds like you'd spend a lot of time with ALTER TABLE as you turn it on and off. I'd like to be able to do that automatically as needed. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 2014-11-06 23:40:18 +, Greg Stark wrote: On Thu, Nov 6, 2014 at 9:30 PM, Andres Freund and...@2ndquadrant.com wrote: I think the retry logical is a largely pointless complication of already complex enough code. You're fixing a problem for which there is absolutely no evidence of its existance. Yes, this happens occasionally. But it's going to be so absolutely minor in comparison to just about every other source of bloat. I agree bloat isn't really a threat, but what about the relfrozenxid? If we skip even one page we don't get to advance it and retrying could eliminate those skipped pages and allow us to avoid a vacuum freeze which can be really painful. Of course that only works if you can be sure you haven't overflowed and forgotten any skipped pages and if you don't find the page still pinned every time until you eventually give up on it. I don't buy this argument. Either you're constantly vacuuming the whole relation anyway - in which case we'll acquire the cleanup lock unconditionally - or we're doing partial vacuums via the visibility map anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 2014-11-06 19:03:20 -0600, Jim Nasby wrote: On 11/6/14, 5:40 PM, Greg Stark wrote: On Thu, Nov 6, 2014 at 9:30 PM, Andres Freund and...@2ndquadrant.com wrote: I think the retry logical is a largely pointless complication of already complex enough code. You're fixing a problem for which there is absolutely no evidence of its existance. Yes, this happens occasionally. But it's going to be so absolutely minor in comparison to just about every other source of bloat. For some reason I don't have Andres' original email, so I'll reply here: I agree with you, and my original proposal was simply to log how many pages were skipped, but that was objected to. Simply logging this extra information would be a patch of a dozen lines or less. The objection was that it's unneccessary complexity. So you made the patch a magnitude more complex *and* added logging? That doesn't make much sense. The problem right now is there's no way to actually obtain evidence that this is (or isn't) something to worry about, because we just silently skip pages. If we had any kind of tracking on this we could stop guessing. :( What's the worst consequence this could have? A couple pages not marked all visible and not immediately cleaned up. That's not particularly harmful. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On 11/07/2014 07:08 AM, Amit Kapila wrote: On Tue, Nov 4, 2014 at 3:17 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 10/27/2014 06:02 PM, Heikki Linnakangas wrote: I came up with the attached patches. They do three things: 1. Get rid of the 64-bit CRC code. It's not used for anything, and haven't been for years, so it doesn't seem worth spending any effort to fix them. 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't need to remain compatible across major versions. Will this change allow database created before this commit to be started after this commit? No. You could use pg_resetxlog to fix the WAL, but I think at least relmap files would still prevent you from starting up. You could use pg_upgrade. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
(2014/08/28 18:00), Etsuro Fujita wrote: (2014/08/22 11:51), Noah Misch wrote: Today's ANALYZE VERBOSE messaging for former inheritance parents (tables with relhassubclass = true but no pg_inherits.inhparent links) is deceptive, and I welcome a fix to omit the spurious message. As defects go, this is quite minor. There's fundamentally no value in collecting inheritance tree statistics for such a parent, and no PostgreSQL command will do so. A credible alternative is to emit a second message indicating that we skipped the inheritance tree statistics after all, and why we skipped them. I'd like to address this by emitting the second message as shown below: A separate patch (analyze.patch) handles the former case in a similar way. I'll add to the upcoming CF, the analyze.patch as an independent item, which emits a second message indicating that we skipped the inheritance tree statistics and why we skipped them. I did a review of the patch. There was no problem. I confirmed the following. 1. applied cleanly and compilation was without warnings and errors 2. all regress tests was passed ok 3. The message output from ANALYZE VERBOSE. Following are the SQL which I used to check messages. create table parent (id serial); create table child (name text) inherits (parent); ANALYZE VERBOSE parent ; drop table child ; ANALYZE VERBOSE parent ; Regards, -- Furuya Osamu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers