Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hello Robert, I wish to agree, but my interpretation of the previous code is that they were ignored before, so ISTM that we are stuck with keeping the same unfortunate behavior. I don't agree. I'm not in a huge hurry to fix all the places where pgbench currently lacks error checks just because I don't have enough to do (hint: I do have enough to do), but when we're adding more complicated syntax in one particular place, bringing the error checks in that portion of the code up to scratch is an eminently sensible thing to do, and we should do it. Ok. I'm in favor of that anyway. It is just that was afraid that changing behavior, however poor the said behavior, could be a blocker. Also, please stop changing the title of this thread every other post. It breaks threading for me (and anyone else using gmail), and that makes the thread hard to follow. Sorry. It does not break my mailer which relies on internal headers, but I'll try to be compatible with this gmail features in the future. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] delta relations in AFTER triggers
2014-07-28 19:27 GMT+02:00 Marti Raudsepp ma...@juffo.org: On Mon, Jul 28, 2014 at 6:24 PM, Kevin Grittner kgri...@ymail.com wrote: Do you have some other suggestion? Keep in mind that it must allow the code which will *generate* the transition tables to know whether any of the attached triggers use a given transition table for the specific operation, regardless of the language of the trigger function. You will need to access the pg_proc record of the trigger function anyway, so it's just a matter of coming up with syntax that makes sense, right? What I had in mind was that we could re-use function argument declaration syntax. For instance, use the argmode specifier to declare OLD and NEW. Shouldn't cause grammar conflicts because the current OUT and INOUT aren't reserved keywords. We could also re-use the refcursor type, which already has bindings in some PLs, if that's not too much overhead. That would make the behavior straightforward without introducing new constructs, plus you can pass them around between functions. Though admittedly it's annoying to integrate cursor results into queries. Something like: CREATE FUNCTION trig(OLD old_rows refcursor, NEW new_rows refcursor) RETURNS trigger LANGUAGE plpgsql AS '...'; I dislike this proposal - it is strongly inconsistent with current trigger design Regards Pavel Or maybe if the grammar allows, we could spell out NEW TABLE, OLD TABLE, but that's redundant since you can already deduce that from the refcursor type. It could also be extended for different types, like tid[], and maybe record for the FOR EACH ROW variant (dunno if that can be made to work). Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] delta relations in AFTER triggers
On Tue, Jul 29, 2014 at 9:49 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I dislike this proposal - it is strongly inconsistent with current trigger design The real point I was trying to convey (in my previous email) is that these declarations should be part of the trigger *function* not the function-to-table relationship. CREATE TRIGGER shouldn't be in the business of declaring new local variables for the trigger function. Whether we define new syntax for that or re-use the argument list is secondary. But the inconsistency is deliberate, I find the current trigger API horrible. Magic variables... Text-only TG_ARGV for arguments... RETURNS trigger... No way to invoke trigger functions directly for testing. By not imitating past mistakes, maybe we can eventually arrive at a language that makes sense. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Hello Robert, 3. Similarly, I suggest that the use of gaussian or uniform be an error when argc 6 OR argc 6. I also suggest that the parenthesized distribution type be dropped from the error message in all cases. I wish to agree, but my interpretation of the previous code is that they were ignored before, so ISTM that we are stuck with keeping the same unfortunate behavior. I don't agree. Attached B patch does turn incorrect setrandom syntax into errors instead of ignoring extra parameters. First A patch is repeated to help commitfest references. -- Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README new file mode 100644 index 000..6881256 --- /dev/null +++ b/contrib/pgbench/README @@ -0,0 +1,5 @@ +# gaussian and exponential tests +# with XXX as expo or gauss +psql test test-init.sql +./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test +psql test test-XXX-check.sql diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 4aa8a50..e07206a 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -98,6 +98,8 @@ static int pthread_join(pthread_t th, void **thread_return); #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ +#define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ @@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max) return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state)); } +/* + * random number generator: exponential distribution from min to max inclusive. + * the threshold is so that the density of probability for the last cut-off max + * value is exp(-threshold). + */ +static int64 +getExponentialRand(TState *thread, int64 min, int64 max, double threshold) +{ + double cut, uniform, rand; + Assert(threshold 0.0); + cut = exp(-threshold); + /* erand in [0, 1), uniform in (0, 1] */ + uniform = 1.0 - pg_erand48(thread-random_state); + /* + * inner expresion in (cut, 1] (if threshold 0), + * rand in [0, 1) + */ + Assert((1.0 - cut) != 0.0); + rand = - log(cut + (1.0 - cut) * uniform) / threshold; + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + +/* random number generator: gaussian distribution from min to max inclusive */ +static int64 +getGaussianRand(TState *thread, int64 min, int64 max, double threshold) +{ + double stdev; + double rand; + + /* + * Get user specified random number from this loop, with + * -threshold stdev = threshold + * + * This loop is executed until the number is in the expected range. + * + * As the minimum threshold is 2.0, the probability of looping is low: + * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average + * sinus multiplier as 2/pi, we have a 8.6% looping probability in the + * worst case. For a 5.0 threshold value, the looping probability + * is about e^{-5} * 2 / pi ~ 0.43%. + */ + do + { + /* + * pg_erand48 generates [0,1), but for the basic version of the + * Box-Muller transform the two uniformly distributed random numbers + * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller) + */ + double rand1 = 1.0 - pg_erand48(thread-random_state); + double rand2 = 1.0 - pg_erand48(thread-random_state); + + /* Box-Muller basic form transform */ + double var_sqrt = sqrt(-2.0 * log(rand1)); + stdev = var_sqrt * sin(2.0 * M_PI * rand2); + + /* + * we may try with cos, but there may be a bias induced if the previous + * value fails the test. To be on the safe side, let us try over. + */ + } + while (stdev -threshold || stdev = threshold); + + /* stdev is in [-threshold, threshold), normalization to [0,1) */ + rand = (stdev + threshold) / (threshold * 2.0); + + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + /* call PQexec() and exit() on failure */ static void executeStatement(PGconn *con, const char *sql) @@ -1319,6 +1391,7 @@ top: char *var; int64 min, max; + double threshold = 0; char res[64]; if (*argv[2] == ':') @@ -1364,11 +1437,11 @@ top: } /* - * getrand() needs to be able to subtract max from min and add one - * to the result without overflowing. Since we know max min, we - * can detect overflow just by checking for a negative result. But - * we must check both that the subtraction doesn't overflow, and - * that adding one to the result doesn't overflow either. + * Generate random number functions need to be able to subtract + * max from min and add one to the result without overflowing. + * Since we know max min, we can detect overflow just by checking + * for a negative result. But we must check both that the subtraction + * doesn't overflow, and that adding one to the
Re: [HACKERS] delta relations in AFTER triggers
2014-07-29 9:41 GMT+02:00 Marti Raudsepp ma...@juffo.org: On Tue, Jul 29, 2014 at 9:49 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I dislike this proposal - it is strongly inconsistent with current trigger design The real point I was trying to convey (in my previous email) is that these declarations should be part of the trigger *function* not the function-to-table relationship. CREATE TRIGGER shouldn't be in the business of declaring new local variables for the trigger function. Whether we define new syntax for that or re-use the argument list is secondary. But the inconsistency is deliberate, I find the current trigger API horrible. Magic variables... Text-only TG_ARGV for arguments... RETURNS trigger... A notation RETURNS TRIGGER I don't like too much too - RETURNS void or RETURNS record are much more natural. My dream is some like CREATE OR REPLACE TRIGGER FUNCTION trg() RETURNS RECORD but it is only syntactic sugar - and I don't see any benefit why we should to implement it. No way to invoke trigger functions directly for testing. It is horrible idea. I can agree, it is a limit - but not too hard - there is simple possibility to take code from trigger to auxiliary function. But current design is simply and robust with few possible user errors. By not imitating past mistakes, maybe we can eventually arrive at a language that makes sense. Sorry I disagree. Can be subjective is this API is too or not too bad for redesign. More objective arguments - there are no performance issue, no security issue. I am thinking, so it has sense, so I don't see reason why to change it and why we should to have two API. Last argument, if we change something, then we should to use a ANSI SQL syntax everywhere it is possible (when we don't get any new special functionality). Regards Pavel Regards, Marti
Re: [HACKERS] gaussian distribution pgbench -- splits v4
Attached B patch does turn incorrect setrandom syntax into errors instead of ignoring extra parameters. First A patch is repeated to help commitfest references. Oops, I applied the change on the wrong part:-( Here is the change on part A which checks setrandom syntax, and B for completeness. -- Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README new file mode 100644 index 000..6881256 --- /dev/null +++ b/contrib/pgbench/README @@ -0,0 +1,5 @@ +# gaussian and exponential tests +# with XXX as expo or gauss +psql test test-init.sql +./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test +psql test test-XXX-check.sql diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 4aa8a50..16e44bd 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -98,6 +98,8 @@ static int pthread_join(pthread_t th, void **thread_return); #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ +#define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ @@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max) return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state)); } +/* + * random number generator: exponential distribution from min to max inclusive. + * the threshold is so that the density of probability for the last cut-off max + * value is exp(-threshold). + */ +static int64 +getExponentialRand(TState *thread, int64 min, int64 max, double threshold) +{ + double cut, uniform, rand; + Assert(threshold 0.0); + cut = exp(-threshold); + /* erand in [0, 1), uniform in (0, 1] */ + uniform = 1.0 - pg_erand48(thread-random_state); + /* + * inner expresion in (cut, 1] (if threshold 0), + * rand in [0, 1) + */ + Assert((1.0 - cut) != 0.0); + rand = - log(cut + (1.0 - cut) * uniform) / threshold; + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + +/* random number generator: gaussian distribution from min to max inclusive */ +static int64 +getGaussianRand(TState *thread, int64 min, int64 max, double threshold) +{ + double stdev; + double rand; + + /* + * Get user specified random number from this loop, with + * -threshold stdev = threshold + * + * This loop is executed until the number is in the expected range. + * + * As the minimum threshold is 2.0, the probability of looping is low: + * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average + * sinus multiplier as 2/pi, we have a 8.6% looping probability in the + * worst case. For a 5.0 threshold value, the looping probability + * is about e^{-5} * 2 / pi ~ 0.43%. + */ + do + { + /* + * pg_erand48 generates [0,1), but for the basic version of the + * Box-Muller transform the two uniformly distributed random numbers + * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller) + */ + double rand1 = 1.0 - pg_erand48(thread-random_state); + double rand2 = 1.0 - pg_erand48(thread-random_state); + + /* Box-Muller basic form transform */ + double var_sqrt = sqrt(-2.0 * log(rand1)); + stdev = var_sqrt * sin(2.0 * M_PI * rand2); + + /* + * we may try with cos, but there may be a bias induced if the previous + * value fails the test. To be on the safe side, let us try over. + */ + } + while (stdev -threshold || stdev = threshold); + + /* stdev is in [-threshold, threshold), normalization to [0,1) */ + rand = (stdev + threshold) / (threshold * 2.0); + + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + /* call PQexec() and exit() on failure */ static void executeStatement(PGconn *con, const char *sql) @@ -1319,6 +1391,7 @@ top: char *var; int64 min, max; + double threshold = 0; char res[64]; if (*argv[2] == ':') @@ -1364,11 +1437,11 @@ top: } /* - * getrand() needs to be able to subtract max from min and add one - * to the result without overflowing. Since we know max min, we - * can detect overflow just by checking for a negative result. But - * we must check both that the subtraction doesn't overflow, and - * that adding one to the result doesn't overflow either. + * Generate random number functions need to be able to subtract + * max from min and add one to the result without overflowing. + * Since we know max min, we can detect overflow just by checking + * for a negative result. But we must check both that the subtraction + * doesn't overflow, and that adding one to the result doesn't overflow either. */ if (max - min 0 || (max - min) + 1 0) { @@ -1377,10 +1450,64 @@ top: return true; } + if (argc == 4 || /* uniform without or with uniform keyword */ +(argc == 5 pg_strcasecmp(argv[4], uniform) == 0)) +
Re: [HACKERS] Is analyze_new_cluster.sh still useful?
On Wed, Jun 18, 2014 at 05:41:06PM +0200, Christoph Berg wrote: Hi, now that we have vacuumdb --all --analyze-in-stages in 9.4, wouldn't it make sense to get rid of the analyze_new_cluster.sh file which pg_upgrade writes? The net content is a single line which could as well be printed by pg_upgrade itself. Instead of an lengthy explanation how to invoke that manually, there should be a short note and a pointer to some manual section. I think the chances of people reading that would even be increased. I was not a big fan of keeping analyze_new_cluster.sh with one command in it, but it does maintain the same user API, so I guess that is why people wanted it kept. Similary, I don't really see the usefulness of delete_old_cluster.sh as a file, when rm -rf could just be presented on the console for the admin to execute by cut-and-paste. Uh, that could be hard because delete_old_cluster.sh also can also delete the old major-version-specific subdirectories in tablespaces, so I do think we need to keep that. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Is analyze_new_cluster.sh still useful?
On Fri, Jun 20, 2014 at 05:15:05PM +0200, Christoph Berg wrote: Another nitpick here: What pg_upgrade outputs doesn't even work on most systems, you need to ./analyze_new_cluster.sh or sh analyze_new_cluster.sh. Well, the output is: Optimizer statistics are not transferred by pg_upgrade so, once you start the new server, consider running: analyze_new_cluster.sh Running this script will delete the old cluster's data files: delete_old_cluster.sh It is not really telling you _how_ to run them. Would adding a ./ prefix help? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog add synchronous mode
I have improved the patch by making following changes: 1. Since stream_stop() was redundant, stream_stop() at the time of WAL file closing was deleted. 2. Change the Flash judging timing for the readability of source code. I have changed the Flash judging timing , from the continuous message after receiving to before the feedbackmassege decision of continue statement after execution. Regards, -- Furuya Osamu pg_receivexlog-add-fsync-interval-v4.patch Description: pg_receivexlog-add-fsync-interval-v4.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Production block comparison facility
On 07/23/2014 05:14 PM, Michael Paquier wrote: On Tue, Jul 22, 2014 at 4:49 PM, Michael Paquier michael.paqu...@gmail.com wrote: Then, looking at the code, we would need to tweak XLogInsert for the WAL record construction to always do a FPW and to update XLogCheckBufferNeedsBackup. Then for the redo part, we would need to do some extra operations in the area of RestoreBackupBlock/RestoreBackupBlockContents, including masking operations before comparing the content of the FPW and the current page. Does that sound right? I have spent some time digging more into this idea and finished with the patch attached, doing the following: addition of a consistency check when FPW is restored and applied on a given page. The consistency check is made of two phases: - Apply a mask on the FPW and the current page to eliminate potential conflicts like hint bits for example. - Check that the FPW is consistent with the current page, aka the current page does not contain any new information that the FPW taken has not. This is done by checking the masked portions of the FPW and the current page. Also some more details: - If an inconsistency is found, a WARNING is simply logged. - The consistency check is done if current page is not empty, and if database has reached a consistent state. - The page masking API is taken from the WAL replay patch that was submitted in CF1 and plugged in as an independent set of API. - In masking stuff, to facilitate if a page is used by a sequence relation SEQ_MAGIC as well as the its opaque data structure are renamed and moved into sequence.h. - To facilitate debugging and comparison, the masked FPW and current page are also converted into hex. Things could be refactored and improved for sure, but this patch is already useful as-is so I am going to add it to the next commit fest. I don't understand how this works. A full-page image contains the new page contents *after* the WAL-logged operation. For example, in a heap insert, the full-page image contains the new tuple. How can you compare that with what's on the disk already? ISTM you'd need to log two full-page images for every WAL record. A before image and an after image. Then you could do a lot of checking: 1. the before image should match what's on disk already 2. the result after applying the WAL record should match the after image. That would be more handy than the approach I used, where the page images are logged to a separate file. You wouldn't need to deal with any new files, as all the data is in the WAL. Verification would be done directly in the standby, with no need to run any extra programs. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 07/10/2014 12:41 AM, Alvaro Herrera wrote: Heikki Linnakangas wrote: On 06/23/2014 08:07 PM, Alvaro Herrera wrote: I feel that the below would nevertheless be simpler: I wonder if it would be simpler to just always store the revmap pages in the beginning of the index, before any other pages. Finding the revmap page would then be just as easy as with a separate fork. When the table/index is extended so that a new revmap page is needed, move the existing page at that block out of the way. Locking needs some consideration, but I think it would be feasible and simpler than you have now. Moving index items around is not easy, because you'd have to adjust the revmap to rewrite the item pointers. Hmm. Two alternative schemes come to mind: 1. Move each index tuple off the page individually, updating the revmap while you do it, until the page is empty. Updating the revmap for a single index tuple isn't difficult; you have to do it anyway when an index tuple is replaced. (MMTuples don't contain a heap block number ATM, but IMHO they should, see below) 2. Store the new block number of the page that you moved out of the way in the revmap page, and leave the revmap pointers unchanged. The revmap pointers can be updated later, lazily. Both of those seem pretty straightforward. The trouble I have with moving blocks around to make space, is that it would cause the index to have periodic hiccups to make room for the new revmap pages. One nice property that these indexes are supposed to have is that the effect into insertion times should be pretty minimal. That would cease to be the case if we have to do your proposed block moves. Approach 2 above is fairly quick, quick enough that no-one would notice the hiccup. Moving the tuples individually (approach 1) would be slower. ISTM that when the old tuple cannot be updated in-place, the new index tuple is inserted with mm_doinsert(), but the old tuple is never deleted. It's deleted by the next vacuum. Ah I see. Vacuum reads the whole index, and builds an in-memory hash table that contains an ItemPointerData for every tuple in the index. Doesn't that require a lot of memory, for a large index? That might be acceptable - you ought to have plenty of RAM if you're pushing around multi-terabyte tables - but it would nevertheless be nice to not have a hard requirement for something as essential as vacuum. I guess if you're expecting that pages_per_range=1 is a common case, yeah it might become an issue eventually. Not sure, but I find it easier to think of the patch that way. In any case, it would be nice to avoid the problem, even if it's not common. One idea I just had is to have a bit for each index tuple, which is set whenever the revmap no longer points to it. That way, vacuuming is much easier: just scan the index and delete all tuples having that bit set. The bit needs to be set atomically with the insertion of the new tuple, so why not just remove the old tuple right away? Wouldn't it be simpler to remove the old tuple atomically with inserting the new tuple and updating the revmap? Or at least mark the old tuple as deletable, so that vacuum can just delete it, without building the large hash table to determine that it's deletable. Yes, it might be simpler, but it'd require dirtying more pages on insertions (and holding more page-level locks, for longer. Not good for concurrent access). I wouldn't worry much about the performance and concurrency of this operation. Remember that the majority of updates are expected to not have to update the index, otherwise the minmax index will degenerate quickly and performance will suck anyway. And even when updating the index is needed, in most cases the new tuple fits on the same page, after removing the old one. So the case where you have to insert a new index tuple, remove old one (or mark it dead), and update the revmap to point to the new tuple, is rare. I'm quite surprised by the use of LockTuple on the index tuples. I think the main reason for needing that is the fact that MMTuple doesn't store the heap (range) block number that the tuple points to: LockTuple is required to ensure that the tuple doesn't go away while a scan is following a pointer from the revmap to it. If the MMTuple contained the BlockNumber, a scan could check that and go back to the revmap if it doesn't match. Alternatively, you could keep the revmap page locked when you follow a pointer to the regular index page. There's the intention that these accesses be kept as concurrent as possible; this is why we don't want to block the whole page. Locking individual TIDs is fine in this case (which is not in SELECT FOR UPDATE) because we can only lock a single tuple in any one index scan, so there's no unbounded growth of the lock table. I prefer not to have BlockNumbers in index tuples, because that would make them larger for not much gain. That data would mostly be redundant, and would be necessary
Re: [HACKERS] postgresql.auto.conf and reload
On Mon, Jul 28, 2014 at 11:33 PM, Fujii Masao masao.fu...@gmail.com wrote: There is other side effect on this patch. With the patch, when reloading the configurartion file, the server cannot warm an invalid setting value if it's not the last setting of the parameter. This may cause problematic situation as follows. 1. ALTER SYSTEM SET work_mem TO '1024kB'; 2. Reload the configuration file --- success 3. Then, a user accidentally adds work_mem = '2048KB' into postgresql.conf The setting value '2048KB' is invalid, and the unit should be 'kB' instead of 'KB'. 4. Reload the configuration file --- success The invalid setting is ignored because the setting of work_mem in postgresql.auto.conf is preferred. So a user cannot notice that postgresql.conf has an invalid setting. 5. Failover on shared-disk HA configuration happens, then PostgreSQL fails to start up because of such an invalid setting. When PostgreSQL starts up, the last setting is preferred. But all the settings are checked. The reason is that during startup DataDir is not set by the time server processes config file due to which we process .auto.conf file in second pass. I think ideally it should ignore the invalid setting in such a case as it does during reload, however currently there is no good way to process .auto.conf incase DataDir is not set, so we handle it separately in second pass. Can we live with this issue? If you are worried about the Reload success case, it will anyway fail later on if user actually wants to set it via postgresql.conf because in such a case user has to remove the setting set by Alter System using Alter System param_name To Default. Incase he doesn't have any such intention, then I think ignoring such invalid params is not a concerning issue. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Use unique index for longer pathkeys.
On Mon, Jul 28, 2014 at 3:17 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Now drop the i_t1_pkey_1 and check the query plan again. drop index i_t1_pkey_1; explain (costs off, analyze off) select * from t,t1 where t.a=t1.a order by t1.a,t1.b,t1.c,t1.d; QUERY PLAN Sort Sort Key: t.a, t1.b, t1.c, t1.d - Merge Join Merge Cond: (t.a = t1.a) - Index Scan using i_t_pkey on t - Index Scan using i_t1_pkey_2 on t1 (6 rows) Can't above plan eliminate Sort Key even after dropping index (i_t1_pkey_1)? My patch doesn't so since there no longer a 'common primary pathkeys' in this query. Perhaps the query doesn't allow the sort eliminated. Since a is no more a pkey, t1 can have dulicate rows for the same a, so the joined relation also may have duplicte values in the column a. I think irrespective of that we can trim t1.c t1.d as we have primary key (unique and non column) for t1.a, t1.b. Basically even if we don't use the primary key index, we can still trim the keys in such a case. IIUC, then Tom has mentioned the same in his message related to this issue. I am referring to below text: If we have ORDER BY a, b, c and (a,b) is the primary key, then including c in the ORDER BY list is semantically redundant, *whether or not we use an indexscan on the pkey index at all*. http://www.postgresql.org/message-id/5212.1397599...@sss.pgh.pa.us With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Production block comparison facility
On Tue, Jul 29, 2014 at 7:30 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I don't understand how this works. A full-page image contains the new page contents *after* the WAL-logged operation. For example, in a heap insert, the full-page image contains the new tuple. How can you compare that with what's on the disk already? An exact match of the FPW and the current page is not done, the patch as it stands now checks if a FPW is consistent with the content of current page by checking if it does not include changes that diverge from what the FPW has. For example for a heap insert, if current page has N records pointer1/tup1..pointerN/tupN, FPW should only contain (N+1) records pointer1/tup1..pointer(N+1)/tup(N+1). After applying the mask at block recovery, process simply checks that the FPW and current page contain the first N records, marking FPW and current page as inconsistent if the current page has some garbage like some extra tuple entries not in the FPW. I am sure you have arguments against that though... ISTM you'd need to log two full-page images for every WAL record. A before image and an after image. The after image is the current FPW, so there is nothing else to do for it. But for the before buffer, what do you think about using ReadBufferExtended with RBM_NORMAL? We could grab its content from disk in XLogInsert only when we are sure that a backup block is added. Then you could do a lot of checking: 1. the before image should match what's on disk already 2. the result after applying the WAL record should match the after image. A WAL record can contain up to XLR_MAX_BKP_BLOCKS backup blocks. Should we double it from 4 to 8? That would be more handy than the approach I used, where the page images are logged to a separate file. You wouldn't need to deal with any new files, as all the data is in the WAL. Verification would be done directly in the standby, with no need to run any extra programs. In this case, would it better to control that with a GUC? Making that the default will increase the amount of WAL for all types of applications, except if couple with FPW compression... Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
On Thu, Jun 26, 2014 at 09:59:59AM -0400, Stephen Frost wrote: Simon, * Simon Riggs (si...@2ndquadrant.com) wrote: Which tables are audited would be available via the reloptions field. RLS could be implemented through reloptions too. Would it be useful to some people? Likely. Would it satisfy the users who, today, are actually asking for that feature? No (or at least, not the ones that I've talked with). We could expand quite a few things to work through reloptions but I don't see it as a particularly good design for complex subsystems, of which auditing is absolutely one of those. I saw many mentions of pg_upgrade in this old thread. I think the focus should be in pg_dump, which is where the SQL syntax or custom reloptions would be dumped and restored. pg_upgrade will just use that functionality. In summary, I don't think there is anything pg_upgrade-specific here, but rather the issue of how this information will be dumped and restored, regardless of whether a major upgrade is taking place. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
On Tue, Jul 29, 2014 at 09:08:38AM -0400, Bruce Momjian wrote: On Thu, Jun 26, 2014 at 09:59:59AM -0400, Stephen Frost wrote: Simon, * Simon Riggs (si...@2ndquadrant.com) wrote: Which tables are audited would be available via the reloptions field. RLS could be implemented through reloptions too. Would it be useful to some people? Likely. Would it satisfy the users who, today, are actually asking for that feature? No (or at least, not the ones that I've talked with). We could expand quite a few things to work through reloptions but I don't see it as a particularly good design for complex subsystems, of which auditing is absolutely one of those. I saw many mentions of pg_upgrade in this old thread. I think the focus should be in pg_dump, which is where the SQL syntax or custom reloptions would be dumped and restored. pg_upgrade will just use that functionality. In summary, I don't think there is anything pg_upgrade-specific here, but rather the issue of how this information will be dumped and restored, regardless of whether a major upgrade is taking place. Actually, thinking more, Stephen Frost mentioned that the auditing system has to modify database _state_, and dumping/restoring the state of an extension might be tricky. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New developer TODO suggestions
On Tue, Jun 24, 2014 at 10:58:54AM +0800, Craig Ringer wrote: Hi all Someone recently mentioned that there's no generate_series(numeric, numeric, numeric) . That strikes me as a great candidate for a new-developer-learning-PostgreSQL TODO. A couple of other things I occasionally run into that'd fit the bill: * A user-level elog(...) / ereport(...) function callable from SQL. Useful in CASE statements. * A log_ option to log whenever pg switches to a new xlog segment. The above seem good. * A 'hex' option to 'decode' that decodes regular hex into bytea, or an equivalent decode_hex / hex_decode . That's for plain undecorated hex, not \x literals. * A corresponding encode_hex or hex_encode to emit hex 'text' without \x prefix (not a bytea literal) (Yes, I know you can form bytea literals with concatenation and decode that way, and can strip the \x prefix from a literal on output, but it's often pretty awkward). Uh, don't our SQL string function allow removal of \x? * A user-accessible function to decode unicode escapes like \U1011 in strings. Makes sense. * A function that converts a json array to a PostgreSQL array of a given type if all json members are compatible with the type * Expanding the set of json/jsonb operations to introduce features that people are used to from jquery, mongo, etc. Replace-key-if-exists-without-adding, add-or-replace-key, etc. * (not really Pg proper, but enough users run into this that I think we should encourage interested people to tackle it): In PgAdmin-III either support \copy, \c, etc or detect their use and emit an informative error telling the user to use 'psql'. I think you have to ask Andrew on these. * When a user tries to run psql -f some_custom_format_backup, detect this and emit a useful error message. Ditto stdin. Uh, good idea, but can we really do that in psql? * Add a built-in aggregate for array_agg(anyarray), i.e. build an array of dims n+1 from the input arrays of dims n. For n=1 this can be done with a simple SQL level aggregate definition, so all it really needs is to error on dims 1 IMO. * Add a built-in aggregate form of array_cat ... probably other things I'm forgetting. No idea on these. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
Il 25/07/14 16:15, Michael Paquier ha scritto: On Fri, Jul 25, 2014 at 10:14 PM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: 0. Introduction: = This is a proposal for adding incremental backup support to streaming protocol and hence to pg_basebackup command. Not sure that incremental is a right word as the existing backup methods using WAL archives are already like that. I recall others calling that differential backup from some previous threads. Would that sound better? differential backup is widely used to refer to a backup that is always based on a full backup. An incremental backup can be based either on a full backup or on a previous incremental backup. We picked that name to emphasize this property. 1. Proposal = Our proposal is to introduce the concept of a backup profile. Sounds good. Thanks for looking at that. The backup profile consists of a file with one line per file detailing tablespace, path, modification time, size and checksum. Using that file the BASE_BACKUP command can decide which file needs to be sent again and which is not changed. The algorithm should be very similar to rsync, but since our files are never bigger than 1 GB per file that is probably granular enough not to worry about copying parts of files, just whole files. There are actually two levels of differential backups: file-level, which is the approach you are taking, and block level. Block level backup makes necessary a scan of all the blocks of all the relations and take only the data from the blocks newer than the LSN given by the BASE_BACKUP command. In the case of file-level approach, you could already backup the relation file after finding at least one block already modified. I like the idea of shortcutting the checksum when you find a block with a LSN newer than the previous backup START WAL LOCATION, however I see it as a further optimization. In any case, it is worth storing the backup start LSN in the header section of the backup_profile together with other useful information about the backup starting position. As a first step we would have a simple and robust method to produce a file-level incremental backup. Btw, the size of relation files depends on the size defined by --with-segsize when running configure. 1GB is the default though, and the value usually used. Differential backups can reduce the size of overall backups depending on the application, at the cost of some CPU to analyze the relation blocks that need to be included in the backup. We tested the idea on several multi-terabyte installations using a custom deduplication script which follows this approach. The result is that it can reduce the backup size of more than 50%. Also most of databases in the range 50GB - 1TB can take a big advantage of it. It could also be used in 'refresh' mode, by allowing the pg_basebackup command to 'refresh' an old backup directory with a new backup. I am not sure this is really helpful... Could you please elaborate the last sentence? The final piece of this architecture is a new program called pg_restorebackup which is able to operate on a chain of incremental backups, allowing the user to build an usable PGDATA from them or executing maintenance operations like verify the checksums or estimate the final size of recovered PGDATA. Yes, right. Taking a differential backup is not difficult, but rebuilding a constant base backup with a full based backup and a set of differential ones is the tricky part, but you need to be sure that all the pieces of the puzzle are here. If we limit it to be file-based, the recover procedure is conceptually simple. Read every involved manifest from the start and take the latest available version of any file (or mark it for deletion, if the last time it is named is in a backup_exceptions file). Keeping the algorithm as simple as possible is in our opinion the best way to go. We created a wiki page with all implementation details at https://wiki.postgresql.org/wiki/Incremental_backup I had a look at that, and I think that you are missing the shot in the way differential backups should be taken. What would be necessary is to pass a WAL position (or LSN, logical sequence number like 0/260) with a new clause called DIFFERENTIAL (INCREMENTAL in your first proposal) in the BASE BACKUP command, and then have the server report back to client all the files that contain blocks newer than the given LSN position given for file-level backup, or the blocks newer than the given LSN for the block-level differential backup. In our proposal a file is skipped only, and only if it has the same size, the same mtime and *the same checksum* of the original file. We intentionally want to keep it simple, easily supporting also files that are stored in $PGDATA but don't follow any format known by Postgres. However, even with more complex algorithms, all the
Re: [HACKERS] Proposal: Incremental Backup
Il 25/07/14 20:21, Claudio Freire ha scritto: On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: 1. Proposal = Our proposal is to introduce the concept of a backup profile. The backup profile consists of a file with one line per file detailing tablespace, path, modification time, size and checksum. Using that file the BASE_BACKUP command can decide which file needs to be sent again and which is not changed. The algorithm should be very similar to rsync, but since our files are never bigger than 1 GB per file that is probably granular enough not to worry about copying parts of files, just whole files. That wouldn't nearly as useful as the LSN-based approach mentioned before. I've had my share of rsyncing live databases (when resizing filesystems, not for backup, but the anecdotal evidence applies anyhow) and with moderately write-heavy databases, even if you only modify a tiny portion of the records, you end up modifying a huge portion of the segments, because the free space choice is random. There have been patches going around to change the random nature of that choice, but none are very likely to make a huge difference for this application. In essence, file-level comparisons get you only a mild speed-up, and are not worth the effort. I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids the I/O of inspecting the LSN of entire segments (necessary optimization for huge multi-TB databases) and backups only the portions modified when segments do contain changes, so it's the best of both worlds. Any partial implementation would either require lots of I/O (LSN only) or save very little (file only) unless it's an almost read-only database. From my experience, if a database is big enough and there is any kind of historical data in the database, the file only approach works well. Moreover it has the advantage of being simple and easily verifiable. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it signature.asc Description: OpenPGP digital signature
Re: [HACKERS] Proposal: Incremental Backup
On Tue, Jul 29, 2014 at 1:24 PM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: 1. Proposal = Our proposal is to introduce the concept of a backup profile. The backup profile consists of a file with one line per file detailing tablespace, path, modification time, size and checksum. Using that file the BASE_BACKUP command can decide which file needs to be sent again and which is not changed. The algorithm should be very similar to rsync, but since our files are never bigger than 1 GB per file that is probably granular enough not to worry about copying parts of files, just whole files. That wouldn't nearly as useful as the LSN-based approach mentioned before. I've had my share of rsyncing live databases (when resizing filesystems, not for backup, but the anecdotal evidence applies anyhow) and with moderately write-heavy databases, even if you only modify a tiny portion of the records, you end up modifying a huge portion of the segments, because the free space choice is random. There have been patches going around to change the random nature of that choice, but none are very likely to make a huge difference for this application. In essence, file-level comparisons get you only a mild speed-up, and are not worth the effort. I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids the I/O of inspecting the LSN of entire segments (necessary optimization for huge multi-TB databases) and backups only the portions modified when segments do contain changes, so it's the best of both worlds. Any partial implementation would either require lots of I/O (LSN only) or save very little (file only) unless it's an almost read-only database. From my experience, if a database is big enough and there is any kind of historical data in the database, the file only approach works well. Moreover it has the advantage of being simple and easily verifiable. I don't see how that would be true if it's not full of read-only or append-only tables. Furthermore, even in that case, you need to have the database locked while performing the file-level backup, and computing all the checksums means processing the whole thing. That's a huge amount of time to be locked for multi-TB databases, so how is that good enough? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [GENERAL] pg_dump behaves differently for different archive formats
On Mon, Jul 28, 2014 at 10:55 AM, Tom Lane t...@sss.pgh.pa.us wrote: Stephen Frost sfr...@snowman.net writes: If we're going to change this, it seems to me that the only option would be to change the dump format... Just off-the-cuff, I'm wondering if we could actually not change the real 'format' but simply promote each ACL entry (and similar cases..) to top-level objects and declare that TOC entries should be single statements. I don't think we want even more TOC entries, but it would not be unreasonable to insist that the statement(s) within a TOC entry be subdivided somehow. Essentially the payload of a TOC entry becomes a list of strings rather than just one string. That would mean that the problem could not be fixed for existing archive files; but that seems OK, given the rather small number of complaints so far. If we had something like that, I'd be strongly inclined to get rid of the existing convention whereby comments and ACL commands are separate TOC entries, and make them part of the parent object's TOC entry (which'd mean we'd want to label the sub-strings so we can tell whether they are main object, comment, or ACL). The fewer TOC entries we can have, the better; there is no reason why comments/ACLs should be independently sortable. Maybe, but I think people will still want an option to skip restoring them altogether (at least for ACLs). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
Il 25/07/14 20:44, Robert Haas ha scritto: On Fri, Jul 25, 2014 at 2:21 PM, Claudio Freire klaussfre...@gmail.com wrote: On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: 1. Proposal = Our proposal is to introduce the concept of a backup profile. The backup profile consists of a file with one line per file detailing tablespace, path, modification time, size and checksum. Using that file the BASE_BACKUP command can decide which file needs to be sent again and which is not changed. The algorithm should be very similar to rsync, but since our files are never bigger than 1 GB per file that is probably granular enough not to worry about copying parts of files, just whole files. That wouldn't nearly as useful as the LSN-based approach mentioned before. I've had my share of rsyncing live databases (when resizing filesystems, not for backup, but the anecdotal evidence applies anyhow) and with moderately write-heavy databases, even if you only modify a tiny portion of the records, you end up modifying a huge portion of the segments, because the free space choice is random. There have been patches going around to change the random nature of that choice, but none are very likely to make a huge difference for this application. In essence, file-level comparisons get you only a mild speed-up, and are not worth the effort. I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids the I/O of inspecting the LSN of entire segments (necessary optimization for huge multi-TB databases) and backups only the portions modified when segments do contain changes, so it's the best of both worlds. Any partial implementation would either require lots of I/O (LSN only) or save very little (file only) unless it's an almost read-only database. I agree with much of that. However, I'd question whether we can really seriously expect to rely on file modification times for critical data-integrity operations. I wouldn't like it if somebody ran ntpdate to fix the time while the base backup was running, and it set the time backward, and the next differential backup consequently omitted some blocks that had been modified during the base backup. Our proposal doesn't rely on file modification times for data integrity. We are using the file mtime only as a fast indication that the file has changed, and transfer it again without performing the checksum. If timestamp and size match we rely on *checksums* to decide if it has to be sent. In SMART MODE we would use the file mtime to skip the checksum check in some cases, but it wouldn't be the default operation mode and it will have all the necessary warnings attached. However the SMART MODE isn't a core part of our proposal, and can be delayed until we agree on the safest way to bring it to the end user. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it signature.asc Description: OpenPGP digital signature
Re: [HACKERS] Re: [GENERAL] pg_dump behaves differently for different archive formats
Robert Haas robertmh...@gmail.com writes: On Mon, Jul 28, 2014 at 10:55 AM, Tom Lane t...@sss.pgh.pa.us wrote: If we had something like that, I'd be strongly inclined to get rid of the existing convention whereby comments and ACL commands are separate TOC entries, and make them part of the parent object's TOC entry (which'd mean we'd want to label the sub-strings so we can tell whether they are main object, comment, or ACL). The fewer TOC entries we can have, the better; there is no reason why comments/ACLs should be independently sortable. Maybe, but I think people will still want an option to skip restoring them altogether (at least for ACLs). Sure; we already have such an option, and I'm not suggesting removing it. The implementation would change though: it would have to look at the individual command labels to see which part(s) of a TOC entry to print out. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Mon, Jul 28, 2014 at 1:50 PM, Andres Freund and...@2ndquadrant.com wrote: Don't get me wrong, I don't object to anything in here. It's just that the bigger picture can help giving sensible feedback. Right. I did not get you wrong. :-) The reason I'm making a point of it is that, if somebody wants to object to the way those facilities are designed, it'd be good to get that out of the way now rather than waiting until 2 or 3 patch sets from now and then saying Uh, could you guys go back and rework all that stuff?. I'm not going to complain too loudly now if somebody wants something in there done in a different way, but it's easier to do that now while there's only pg_background sitting on top of it. What I'm thinking of is providing an actual API for the writes instead of hooking into the socket API in a couple places. I.e. have something like typedef struct DestIO DestIO; struct DestIO { void (*flush)(struct DestIO *io); int (*putbytes)(struct DestIO *io, const char *s, size_t len); int (*getbytes)(struct DestIO *io, const char *s, size_t len); ... } and do everything through it. I haven't thought much about the specific API we want, but abstracting the communication properly instead of adding hooks here and there is imo much more likely to succeed in the long term. This sounds suspiciously like the DestReceiver thing we've already got, except that the DestReceiver only applies to tuple results, not errors and notices and so on. I'm not totally unamenable to a bigger refactoring here, but right now it looks to me like a solution in search of a problem. The hooks are simple and seem to work fine; I don't want to add notation for its own sake. Also, you seem to have only touched receiving from the client, and not sending back to the subprocess. Is that actually sufficient? I'd expect that for this facility to be fully useful it'd have to be two way communication. But perhaps I'm overestimating what it could be used for. Well, the basic shm_mq infrastructure can be used to send any kind of messages you want between any pair of processes that care to establish them. But in general I expect that data is going to flow mostly in one direction - the user backend will launch workers and give them an initial set of instructions, and then results will stream back from the workers to the user backend. Other messaging topologies are certainly possible, and probably useful for something, but I don't really know exactly what those things will be yet, and I'm not sure the FEBE protocol will be the right tool for the job anyway. It's imo not particularly unreasonable to e.g. COPY to/from a bgworker. Which would require the ability to both read/write from the other side. Well, that should work fine if the background worker and user backend generate the CopyData messages via some bespoke code rather than expecting to be able to jump into copy.c and have everything work. If you want that to work, why? It doesn't make much sense for pg_background, because I don't think it would be sensible for SELECT pg_background_result(...) to return CopyInResponse or CopyOutResponse, and even if it were sensible it doesn't seem useful. And I can't think of any other application off-hand, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Making joins involving ctid work for the benefit of UPSERT
On Mon, Jul 28, 2014 at 1:43 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Jul 28, 2014 at 8:37 AM, Robert Haas robertmh...@gmail.com wrote: AFAIUI, this is because your implementation uses lwlocks in a way that Andres and I both find unacceptable. That's not the case. My implementation uses page-level heavyweight locks. The nbtree AM used to use them for other stuff. Plenty of other systems use index level locks managed by a heavyweight lock manager. Oh, OK. I think I missed that development somewhere. Good point. Maybe the syntax should be something like: UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [, ...] | select_query } That would address both the concern about being able to pipe multiple tuples through it and the point you just raised. We look for a row that matches each given tuple on the key columns; if one is found, we update it; if none is found, we insert. That basically is my design, except that (tangentially) yours risks bloat in exchange for not having to use a real locking mechanism, and has a different syntax. I think it would be advisable to separate the syntax from the implementation. Presumably you can make your implementation use some reasonable syntax we can all agree on, and conversely my proposed syntax could be made to have a different set of semantics. There's some connection between the syntax and semantics, of course, but it's not 100%. I mention this because I was mostly concerned with getting to a reasonable syntax proposal, not so much the implementation details. It may well be that your implementation details are perfect at this point; I don't know because I haven't looked, and I'm not an expert on that area of the code anyway. But I have looked at your syntax, which I wasn't altogether keen on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Making joins involving ctid work for the benefit of UPSERT
On Tue, Jul 29, 2014 at 9:56 AM, Robert Haas robertmh...@gmail.com wrote: I think it would be advisable to separate the syntax from the implementation. Presumably you can make your implementation use some reasonable syntax we can all agree on, and conversely my proposed syntax could be made to have a different set of semantics. There's some connection between the syntax and semantics, of course, but it's not 100%. I mention this because I was mostly concerned with getting to a reasonable syntax proposal, not so much the implementation details. It may well be that your implementation details are perfect at this point; I don't know because I haven't looked, and I'm not an expert on that area of the code anyway. But I have looked at your syntax, which I wasn't altogether keen on. Fair enough. I think the syntax should reflect the fact that upserts are driven by inserts, though. Users will get into trouble with a syntax that allows a predicate that is evaluated before any rows are locked. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance issue in pg_dump's dependency loop searching
On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote: Another idea would be to ...persist the optimal dump order in the database. That way we can maintain the correct dump order each time we do DDL, which is only a small incremental cost, no matter how many objects we have. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance issue in pg_dump's dependency loop searching
Simon Riggs si...@2ndquadrant.com writes: On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote: Another idea would be to ...persist the optimal dump order in the database. That way we can maintain the correct dump order each time we do DDL, which is only a small incremental cost, no matter how many objects we have. I don't see any obvious way to make it incremental; so I doubt that it would be a small extra cost. In any case I disagree that making DDL slower to make pg_dump faster is a good tradeoff. Many people seldom or never use pg_dump. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance issue in pg_dump's dependency loop searching
On Tue, Jul 29, 2014 at 3:06 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote: Another idea would be to ...persist the optimal dump order in the database. That way we can maintain the correct dump order each time we do DDL, which is only a small incremental cost, no matter how many objects we have. I don't see any obvious way to make it incremental; so I doubt that it would be a small extra cost. In any case I disagree that making DDL slower to make pg_dump faster is a good tradeoff. Many people seldom or never use pg_dump. regards, tom lane Not to mention slowing down temp tables -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan p...@heroku.com wrote: I attach a new revision. I think that I may have missed a trick here. It turns out that it isn't expensive to also hash original text values, to track their cardinality using HyperLogLog - it's hardly measurable when done at an opportune point just after strxfrm() expansion. I was already tracking the cardinality of poor man's normalized keys using HyperLogLog. If I track the cardinality of both sets (original values and normalized keys), I can compare the two when evaluating if normalization should be aborted. This is by far the most important consideration. This causes the optimization to be applied when sorting millions of tuples with only a tiny number of distinct values (like, 4 or 5), without making bad cases that we fail to abort in a timely manner any more likely. This is still significantly profitable - over 90% faster in one test, because the optimistic memcmp() still allows us to avoid any strcoll() calls. It looks about the same as using the C collation. Not quite the huge boost we can see, but still well worthwhile. In general it seems well principled to have the should I abort normalization? algorithm mostly weigh how effective a proxy for full key cardinality normalized key cardinality is. If it is a good proxy then nothing else matters. If it's not a very good proxy, that can only be because there are many differences beyond the first 8 bytes. Only then will we weigh the total number of distinct normalized keys, and as the effectiveness of normalized key cardinality as a proxy for overall cardinality falls, our requirements for the overall number of distinct normalized keys shoots up rapidly. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] multixact optimization patches
I have just pushed two optimization patches for multixacts to HEAD only. I hesitate to backpatch them to 9.3 right away, but will consider doing so if people think differently. I also have a patch that supposedly fixes the performance regression reported in bug #8470, but it's considerably more obscure so I'm not pushing it right now. It's attached. I'd need to spend some more time with it to ensure it doesn't break other cases before pushing. I know some people is interested in this fix and thought I'd throw it out there to gather some input. Since it affects much of the same code as the two patches I just pushed, using it in 9.3 requires cherry-picking those patches, but they apply without conflicts so it should be easy. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services *** a/contrib/pgrowlocks/pgrowlocks.c --- b/contrib/pgrowlocks/pgrowlocks.c *** *** 136,149 pgrowlocks(PG_FUNCTION_ARGS) infomask = tuple-t_data-t_infomask; /* ! * a tuple is locked if HTSU returns BeingUpdated, and if it returns ! * MayBeUpdated but the Xmax is valid and pointing at us. */ ! if (htsu == HeapTupleBeingUpdated || ! (htsu == HeapTupleMayBeUpdated ! !(infomask HEAP_XMAX_INVALID) ! !(infomask HEAP_XMAX_IS_MULTI) ! (xmax == GetCurrentTransactionIdIfAny( { char **values; --- 136,144 infomask = tuple-t_data-t_infomask; /* ! * A tuple is locked if HTSU returns BeingUpdated. */ ! if (htsu == HeapTupleBeingUpdated) { char **values; *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** *** 2711,2791 l1: } else if (result == HeapTupleBeingUpdated wait) { - TransactionId xwait; - uint16 infomask; - - /* must copy state data before unlocking buffer */ - xwait = HeapTupleHeaderGetRawXmax(tp.t_data); - infomask = tp.t_data-t_infomask; - - LockBuffer(buffer, BUFFER_LOCK_UNLOCK); - /* ! * Acquire tuple lock to establish our priority for the tuple (see ! * heap_lock_tuple). LockTuple will release us when we are ! * next-in-line for the tuple. * ! * If we are forced to start over below, we keep the tuple lock; ! * this arranges that we stay at the head of the line while rechecking ! * tuple state. */ ! if (!have_tuple_lock) { ! LockTupleTuplock(relation, (tp.t_self), LockTupleExclusive); ! have_tuple_lock = true; } ! /* ! * Sleep until concurrent transaction ends. Note that we don't care ! * which lock mode the locker has, because we need the strongest one. ! */ ! if (infomask HEAP_XMAX_IS_MULTI) ! { ! /* wait for multixact */ ! MultiXactIdWait((MultiXactId) xwait, MultiXactStatusUpdate, infomask, ! relation, tp.t_data-t_ctid, XLTW_Delete, ! NULL); ! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* ! * If xwait had just locked the tuple then some other xact could ! * update this tuple before we get to this point. Check for xmax ! * change, and start over if so. */ ! if (xmax_infomask_changed(tp.t_data-t_infomask, infomask) || ! !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data), ! xwait)) ! goto l1; /* ! * You might think the multixact is necessarily done here, but not ! * so: it could have surviving members, namely our own xact or ! * other subxacts of this backend. It is legal for us to delete ! * the tuple in either case, however (the latter case is ! * essentially a situation of upgrading our former shared lock to ! * exclusive). We don't bother changing the on-disk hint bits ! * since we are about to overwrite the xmax altogether. */ - } - else - { - /* wait for regular transaction to end */ - XactLockTableWait(xwait, relation, tp.t_data-t_ctid, XLTW_Delete); - LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); ! /* ! * xwait is done, but if xwait had just locked the tuple then some ! * other xact could update this tuple before we get to this point. ! * Check for xmax change, and start over if so. ! */ ! if (xmax_infomask_changed(tp.t_data-t_infomask, infomask) || ! !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data), ! xwait)) ! goto l1; ! /* Otherwise check if it committed or aborted */ ! UpdateXmaxHintBits(tp.t_data, buffer, xwait); } /* --- 2711,2814 } else if (result == HeapTupleBeingUpdated wait) { /* ! * Somebody is holding a lock on the tuple, or updating it; we now ! * need to sleep on that transaction before we can proceed. However, ! * if the only locker is our own transaction (or any subtransaction of ! * the current top transaction), then it's not necessary to do so. ! * Note we only check for this case when the locker is a single xid, ! * because MultiXactIdWait
Re: [HACKERS] Reminder: time to stand down from 8.4 maintenance
* Tom Lane (t...@sss.pgh.pa.us) wrote: PG 8.4.x is EOL as of last week's releases, so it's time to remove that branch from any auto-update scripts you might have, reconfigure buildfarm members that are force-building it, etc. I've removed it from the Coverity weekly runs. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] plpgsql.consistent_into
On 1/14/14, 6:15 PM, Tom Lane wrote: We don't actually implement this in PG yet, except for trivial cases, but it will certainly happen eventually. I think your sketch above deviates unnecessarily from what the standard says for UPDATE. In particular I think it'd be better to write things like (a, b) = ROW(1, 2); (a, b, c) = (SELECT x, y, z FROM foo WHERE id = 42); which would exactly match what you'd write in a multiple-assignment UPDATE, and it has the same rejects-multiple-rows semantics too. Just in case someone's interested: I won't be working on this for 9.5. If someone feels like picking this patch up, be my guest. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
Robert Haas robertmh...@gmail.com writes: On Fri, Jul 25, 2014 at 6:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: This isn't really acceptable for production usage; if it were, we'd have done it already. The POSIX APIs lack any way to tell how many processes are attached to a shmem segment, which is *necessary* functionality for us (it's a critical part of the interlock against starting multiple postmasters in one data directory). I think it would be good to spend some energy figuring out what to do about this. Well, we've been around on this multiple times before, but if we have any new ideas, sure ... In our last discussion on this topic, we talked about using file locks as a substitute for nattch. You concluded that fcntl was totally broken for this purpose because of the possibility of some other piece of code accidentally opening and closing the lock file.[2] lockf appears to have the same problem, but flock might not, at least on some systems. My Linux man page for flock says flock() does not lock files over NFS. Use fcntl(2) instead: that does work over NFS, given a sufficiently recent version of Linux and a server which supports locking. which seems like a showstopper problem; we might try to tell people not to put their databases on NFS, but they're not gonna listen. It also says flock() and fcntl(2) locks have different semantics with respect to forked processes and dup(2). On systems that implement flock() using fcntl(2), the semantics of flock() will be different from those described in this manual page. which is pretty scary if it's accurate for any still-extant platforms; we might think we're using flock and still get fcntl behavior. It's also of concern that (AFAICS) flock is not in POSIX, which means we can't even expect that platforms will agree on how it *should* behave. I also noted that flock does not support atomic downgrade of exclusive lock to shared lock, which seems like a problem for the lock inheritance scheme sketched in http://www.postgresql.org/message-id/18162.1340761...@sss.pgh.pa.us ... but OTOH, it sounds like flock locks are not only inherited through fork() but even preserved across exec(), which would mean that we don't need that scheme for file lock inheritance, even with EXEC_BACKEND. Still, it's not clear to me how we could put much faith in flock. Finally, how about named pipes? Linux says that trying to open a named pipe for write when there are no readers will return ENXIO, and attempting to write to an already-open pipe with no remaining readers will cause SIGPIPE. So: create a permanent named pipe in the data directory that all PostgreSQL processes keep open. When the postmaster starts, it opens the pipe for read, then for write, then closes it for read. It then tries to write to the pipe. If this fails to result in SIGPIPE, then somebody else has got the thing open; so the new postmaster should die at once. But if does get a SIGPIPE then there are as of that moment no other readers. Hm. That particular protocol is broken: two postmasters doing it at the same time would both pass (because neither has it open for read at the instant where they try to write). But we could possibly frob the idea until it works. Bigger question is how portable is this behavior? I see named pipes (fifos) in SUS v2, which is our usual baseline assumption about what's portable across Unixen, so maybe it would work. But does NFS support named pipes? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
Thank you to all who have responded to this proposal. PostgreSQL manages to meet all production requirements on Windows without System V shared memory, so I would think this can be achieved on QNX/Linux. The old PostgreSQL QNX port ran on the very old QNX4 (1991), so I understand why it would be of little value today. Currently, QNX Neutrino 6.5 is well established (and QNX 6.6 is emerging) and that is where a PostgreSQL port would be well received. I have attached my current work-in-progress patches for 9.3.4 and 9.4beta2 for the curious. To minimize risk, I have been careful to ensure my changes will have effect only QNX builds, existing ports should see zero impact. To minimize addition of new files, I have used the linux template rather than add qnx6 as a separate port/template. All regression tests pass on my system, so while not perfect it is at least a reasonable start. posix_shmem.c is still in need of some cleanup and mitigations to make it production-strength. If there are existing tests I can run to ensure the QNX port meets your criteria for robust failure handling in this area I would be happy to run them. If not, perhaps someone can provide a quick list of failure modes to consider. As-is: - starting of a second postmaster fails with message 'FATAL: lock file postmaster.pid already exists' - Kill -9 of postmaster followed by a pg_ctl start seems to go through recovery, although the original shared memory segments hang out in /dev/shmem until reboot (that could be better). Thanks again and please let me know if I can be of any assistance. Keith Baker -Original Message- From: Tom Lane [mailto:t...@sss.pgh.pa.us] Sent: Tuesday, July 29, 2014 7:06 PM To: Robert Haas Cc: Baker, Keith [OCDUS Non-JJ]; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL Robert Haas robertmh...@gmail.com writes: On Fri, Jul 25, 2014 at 6:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: This isn't really acceptable for production usage; if it were, we'd have done it already. The POSIX APIs lack any way to tell how many processes are attached to a shmem segment, which is *necessary* functionality for us (it's a critical part of the interlock against starting multiple postmasters in one data directory). I think it would be good to spend some energy figuring out what to do about this. Well, we've been around on this multiple times before, but if we have any new ideas, sure ... In our last discussion on this topic, we talked about using file locks as a substitute for nattch. You concluded that fcntl was totally broken for this purpose because of the possibility of some other piece of code accidentally opening and closing the lock file.[2] lockf appears to have the same problem, but flock might not, at least on some systems. My Linux man page for flock says flock() does not lock files over NFS. Use fcntl(2) instead: that does work over NFS, given a sufficiently recent version of Linux and a server which supports locking. which seems like a showstopper problem; we might try to tell people not to put their databases on NFS, but they're not gonna listen. It also says flock() and fcntl(2) locks have different semantics with respect to forked processes and dup(2). On systems that implement flock() using fcntl(2), the semantics of flock() will be different from those described in this manual page. which is pretty scary if it's accurate for any still-extant platforms; we might think we're using flock and still get fcntl behavior. It's also of concern that (AFAICS) flock is not in POSIX, which means we can't even expect that platforms will agree on how it *should* behave. I also noted that flock does not support atomic downgrade of exclusive lock to shared lock, which seems like a problem for the lock inheritance scheme sketched in http://www.postgresql.org/message-id/18162.1340761...@sss.pgh.pa.us ... but OTOH, it sounds like flock locks are not only inherited through fork() but even preserved across exec(), which would mean that we don't need that scheme for file lock inheritance, even with EXEC_BACKEND. Still, it's not clear to me how we could put much faith in flock. Finally, how about named pipes? Linux says that trying to open a named pipe for write when there are no readers will return ENXIO, and attempting to write to an already-open pipe with no remaining readers will cause SIGPIPE. So: create a permanent named pipe in the data directory that all PostgreSQL processes keep open. When the postmaster starts, it opens the pipe for read, then for write, then closes it for read. It then tries to write to the pipe. If this fails to result in SIGPIPE, then somebody else has got the thing open; so the new postmaster should die at once. But if does get a SIGPIPE then there are as of that moment no other readers. Hm. That
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
Baker, Keith [OCDUS Non-JJ] kbak...@its.jnj.com writes: If there are existing tests I can run to ensure the QNX port meets your criteria for robust failure handling in this area I would be happy to run them. If not, perhaps someone can provide a quick list of failure modes to consider. As-is: - starting of a second postmaster fails with message 'FATAL: lock file postmaster.pid already exists' - Kill -9 of postmaster followed by a pg_ctl start seems to go through recovery, although the original shared memory segments hang out in /dev/shmem until reboot (that could be better). Unfortunately, that probably proves it's broken rather than that it works. The behavior we need is that after kill -9'ing the postmaster, subsequent postmaster start attempts *fail* until all the original postmaster's child processes are gone. Otherwise you end up with two independent sets of processes scribbling on the same files (and not sharing shmem either). Kiss consistency goodbye ... It's possible that all the children automatically exited, especially if you had only background processes active; but if you had a live regular session it would not exit just because the parent process died. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Wed, Jul 30, 2014 at 1:11 AM, Marco Nenciarini marco.nenciar...@2ndquadrant.it wrote: differential backup is widely used to refer to a backup that is always based on a full backup. An incremental backup can be based either on a full backup or on a previous incremental backup. We picked that name to emphasize this property. You can refer to this email: http://www.postgresql.org/message-id/cabuevexz-2nh6jxb5sjs_dss7qbmof0noypeeyaybkbufkp...@mail.gmail.com As a first step we would have a simple and robust method to produce a file-level incremental backup. An approach using Postgres internals, which we are sure we can rely on, is more robust. A LSN is similar to a timestamp in pg internals as it refers to the point in time where a block was lastly modified. It could also be used in 'refresh' mode, by allowing the pg_basebackup command to 'refresh' an old backup directory with a new backup. I am not sure this is really helpful... Could you please elaborate the last sentence? This overlaps with the features you are proposing with pg_restorebackup, where a backup is rebuilt. Why implementing two interfaces for the same things? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER SYSTEM RESET?
On Thu, Jun 26, 2014 at 8:17 PM, Vik Fearing vik.fear...@dalibo.com wrote: I didn't quite follow your ALTER TABLE example because I don't think it's necessary, I was asking to split the ALTER SYSTEM command like it's there for ALTER TABLE (AlterTableStmt: ALTER TABLE relation_expr alter_table_cmds). It would have make adding further commands to ALTER SYSTEM bit simpler and systemetic. However as there is no correctness issue here, so lets leave it like you have currently done in patch. I have verified the patch and found that it works well for all scenario's. Few minor suggestions: 1. !values to the filenamepostgresql.auto.conf/filename file. !Setting the parameter to literalDEFAULT/literal, or using the !commandRESET/command variant, removes the configuration entry from It would be better if we can write a separate line for RESET ALL as is written in case of both Alter Database and Alter Role in their respective documentation. 2. ! %type vsetstmt generic_set set_rest set_rest_more generic_reset reset_rest SetResetClause FunctionSetResetClause Good to break it into 2 lines. 3. I think we can add some text on top of function AlterSystemSetConfigFile() to explain functionality w.r.t reset all. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com