Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Tue, Aug 19, 2014 at 4:27 PM, Amit Kapila amit.kapil...@gmail.com wrote: Few more comments: Some more comments: 1. I could see one shortcoming in the way the patch has currently parallelize the work for --analyze-in-stages. Basically patch is performing the work for each stage for multiple tables in concurrent connections that seems okay for the cases when number of parallel connections is less than equal to number of tables, but for the case when user has asked for more number of connections than number of tables, then I think this strategy will not be able to use the extra connections. 2. Similarly for the case of multiple databases, currently it will not be able to use connections more than number of tables in each database because the parallelizing strategy is to just use the conncurrent connections for tables inside single database. I am not completely sure whether current strategy is good enough or we should try to address the above problems. What do you think? 3. + do + { + i = select_loop(maxFd, slotset); + Assert(i != 0); Could you explain the reason of using this loop, I think you want to wait for data on socket descriptor, but why for maxFd? Also it is better if you explain this logic in comments. 4. + for (i = 0; i max_slot; i++) + { + if (!FD_ISSET(pSlot[i].sock, slotset)) + continue; + + PQconsumeInput(pSlot[i].connection); + if (PQisBusy(pSlot[i].connection)) + continue; I think it is better to call PQconsumeInput() only if you find connection is busy. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] pgbench throttling latency limit
Add --limit to limit latency under throttling Under throttling, transactions are scheduled for execution at certain times. Transactions may be far behind schedule and the system may catch up with the load later. This option allows to change this behavior by skipping transactions which are too far behind schedule, and count those as skipped. The idea is to help simulate a latency-constrained environment such as a database used by a web server. Find attached a new version: - fix dropped percent computation in the final report - simplify progress report code -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 2f7d80e..37a4a8f 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -141,6 +141,13 @@ double sample_rate = 0.0; int64 throttle_delay = 0; /* + * When under throttling, execution time slots which are more than + * this late (in us) are skipped, and the corresponding transaction + * will be counted as somehow aborted. + */ +int64 throttle_latency_limit = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -238,6 +245,7 @@ typedef struct int64 throttle_trigger; /* previous/next throttling (us) */ int64 throttle_lag; /* total transaction lag behind throttling */ int64 throttle_lag_max; /* max transaction lag */ + int64 throttle_latency_skipped; /* lagging transactions skipped */ } TState; #define INVALID_THREAD ((pthread_t) 0) @@ -250,6 +258,7 @@ typedef struct int64 sqlats; int64 throttle_lag; int64 throttle_lag_max; + int64 throttle_latency_skipped; } TResult; /* @@ -367,6 +376,8 @@ usage(void) -f, --file=FILENAME read transaction script from FILENAME\n -j, --jobs=NUM number of threads (default: 1)\n -l, --logwrite transaction times to log file\n + -L, --limit=NUM under throttling (--rate), skip transactions that\n + far behind schedule in ms (default: do not skip)\n -M, --protocol=simple|extended|prepared\n protocol for submitting queries (default: simple)\n -n, --no-vacuum do not run VACUUM before tests\n @@ -1016,6 +1027,24 @@ top: thread-throttle_trigger += wait; + if (throttle_latency_limit) + { + instr_time now; + int64 now_us; + INSTR_TIME_SET_CURRENT(now); + now_us = INSTR_TIME_GET_MICROSEC(now); + while (thread-throttle_trigger now_us - throttle_latency_limit) + { +/* if too far behind, this slot is skipped, and we + * iterate till the next nearly on time slot. + */ +int64 wait = (int64) (throttle_delay * + 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); +thread-throttle_trigger += wait; +thread-throttle_latency_skipped ++; + } + } + st-until = thread-throttle_trigger; st-sleeping = 1; st-throttling = true; @@ -2351,7 +2380,8 @@ printResults(int ttype, int64 normal_xacts, int nclients, TState *threads, int nthreads, instr_time total_time, instr_time conn_total_time, int64 total_latencies, int64 total_sqlats, - int64 throttle_lag, int64 throttle_lag_max) + int64 throttle_lag, int64 throttle_lag_max, + int64 throttle_latency_skipped) { double time_include, tps_include, @@ -2417,6 +2447,10 @@ printResults(int ttype, int64 normal_xacts, int nclients, */ printf(rate limit schedule lag: avg %.3f (max %.3f) ms\n, 0.001 * throttle_lag / normal_xacts, 0.001 * throttle_lag_max); + if (throttle_latency_limit) + printf(number of skipped transactions: INT64_FORMAT (%.3f %%)\n, + throttle_latency_skipped, + 100.0 * throttle_latency_skipped / (throttle_latency_skipped + normal_xacts)); } printf(tps = %f (including connections establishing)\n, tps_include); @@ -2505,6 +2539,7 @@ main(int argc, char **argv) {sampling-rate, required_argument, NULL, 4}, {aggregate-interval, required_argument, NULL, 5}, {rate, required_argument, NULL, 'R'}, + {limit, required_argument, NULL, 'L'}, {NULL, 0, NULL, 0} }; @@ -2534,6 +2569,7 @@ main(int argc, char **argv) int64 total_sqlats = 0; int64 throttle_lag = 0; int64 throttle_lag_max = 0; + int64 throttle_latency_skipped = 0; int i; @@ -2775,6 +2811,18 @@ main(int argc, char **argv) throttle_delay = (int64) (100.0 / throttle_value); } break; + case 'L': +{ + double limit_ms = atof(optarg); + if (limit_ms = 0.0) + { + fprintf(stderr, invalid latency limit: %s\n, optarg); + exit(1); + } + benchmarking_option_set = true; + throttle_latency_limit = (int64) (limit_ms * 1000); +} +break; case 0: /* This covers long options which take no argument. */ if (foreign_keys || unlogged_tables) @@ -2895,6 +2943,12 @@ main(int argc, char **argv) exit(1); } + if (throttle_latency_limit != 0 throttle_delay == 0) + { + fprintf(stderr, latency limit (-L)
Re: [HACKERS] Parallel Sequence Scan doubts
On Sun, Aug 24, 2014 at 12:34 PM, Craig Ringer cr...@2ndquadrant.com wrote: On 08/24/2014 09:40 AM, Haribabu Kommi wrote: Any suggestions? Another point I didn't raise first time around, but that's IMO quite significant, is that you haven't addressed why this approach to fully parallel seqscans is useful and solves real problems in effective ways. It might seem obvious - of course they're useful. But I see two things they'd address: - CPU-limited sequential scans, where expensive predicates are filtering the scan; and Yes, we are mainly targeting CPU-limited sequential scans, Because of this reason only I want the worker to handle the predicates also not just reading the tuples from disk. - I/O limited sequential scans, where the predicates already execute fast enough on one CPU, so most time is spent waiting for more disk I/O. The problem I see with your design is that it's not going to be useful for a large class of CPU-limited scans where the predicate isn't composed entirely of immutable functions an operators. Especially since immutable-only predicates are the best candidates for expression indexes anyway. While it'd likely be useful for I/O limited scans, it's going to increase contention on shared_buffers locking and page management. More importantly, is it the most efficient way to solve the problem with I/O limited scans? I would seriously suggest looking at first adding support for asynchronous I/O across ranges of extents during a sequential scan. You might not need multiple worker backends at all. I'm sure using async I/O to implement effective_io_concurrency for seqscans has been been discussed and explored before, so again I think some time in the list archives might make sense. Thanks for your inputs. I will check it. I don't know if it makes sense to do something as complex and parallel multi-process seqscans without having a path forward for supporting non-immutable functions - probably with fmgr API enhancements, additional function labels (PARALLEL), etc, depending on what you find is needed. Thanks for your inputs. I will check for a solution to extend the support for non-immutable functions also. 3. In the executor Init phase, Try to copy the necessary data required by the workers and start the workers. Copy how? Back-ends can only communicate with each other over shared memory, signals, and using sockets. Sorry for not being clear, copying those data structures into dynamic shared memory only. From there the workers can access. That'll probably work with read-only data, but it's not viable for read/write data unless you use a big lock to protect it, in which case you lose the parallelism you want to achieve. As of now I am thinking of sharing read-only data with workers. In case if read/write data needs to be shared, then we may need another approach to handle the same. You'd have to classify what may be modified during scan execution carefully and determine if you need to feed any of the resulting modifications back to the original backend - and how to merge modifications by multiple workers, if it's even possible. That's going to involve a detailed structure-by-structure analysis and seems likely to be error prone and buggy. Thanks for your inputs. I will check it properly. I think you should probably talk to Robert Haas about what he's been doing over the last couple of years on parallel query. Sure, I will check with him. 4. In the executor run phase, just get the tuples which are sent by the workers and process them further in the plan node execution. Again, how do you propose to copy these back to the main bgworker? With the help of message queues that are created in the dynamic shared memory, the workers can send the data to the queue. On other side the main backend receives the tuples from the queue. OK, so you plan to implement shmem queues. That'd be a useful starting point, as it'd be something that would be useful in its own right. shmem queues are already possible with dynamic shared memory. Just I want to use them here. You'd have to be able to handle individual values that're than the ring buffer or whatever you're using for transfers, in case you're dealing with already-detoasted tuples or in-memory tuples. Again, chatting with Robert and others who've worked on dynamic shmem, parallel query, etc would be wise here. Yes you are correct. For that reason only I am thinking of Supporting of functions that only dependent on input variables and are not modifying any global data. You'll want to be careful with that. Nothing stops an immutable function referencing a cache in a C global that it initializes one and then treats as read only, for example. I suspect you'll need a per-function whitelist. I'd love to be wrong. Yes, we need per-function level details. Once we have a better solution to handle non-immutable functions also then these may not be required.
Re: [HACKERS] implement subject alternative names support for SSL connections
On Wed, Aug 20, 2014 at 11:53 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 07/25/2014 07:10 PM, Alexey Klyukin wrote: Greetings, I'd like to propose a patch for checking subject alternative names entry in the SSL certificate for DNS names during SSL authentication. Thanks! I just ran into this missing feature last week, while working on my SSL test suite. So +1 for having the feature. This patch needs to be rebased over current master branch, thanks to my refactoring that moved all OpenSSL-specific stuff to be-secure-openssl.c. The patch is rebased against fe-secure-openssl.c (that's where verify_peer_name_matches_certificate appeared in the master branch), I've changed the condition in the for loop to be less confusing (thanks to comments from Magnus and Tom), making an explicit break once a match is detected. Note that It generates a lot of OpenSSL related warnings on my system (66 total) with clang, complaining about $X is deprecated: first deprecated in OS X 10.7 [-Wdeprecated-declarations], but it does so for most other SSL functions, so I don't think it's a problem introduced by this patch. Sincerely, Alexey. diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c new file mode 100644 index f950fc3..b4f6bc9 *** a/src/interfaces/libpq/fe-secure-openssl.c --- b/src/interfaces/libpq/fe-secure-openssl.c *** *** 60,65 --- 60,66 #ifdef USE_SSL_ENGINE #include openssl/engine.h #endif + #include openssl/x509v3.h static bool verify_peer_name_matches_certificate(PGconn *); static intverify_cb(int ok, X509_STORE_CTX *ctx); *** wildcard_certificate_match(const char *p *** 473,479 /* ! *Verify that common name resolves to peer. */ static bool verify_peer_name_matches_certificate(PGconn *conn) --- 474,480 /* ! *Verify that common name or any of the alternative dNSNames resolves to peer. */ static bool verify_peer_name_matches_certificate(PGconn *conn) *** verify_peer_name_matches_certificate(PGc *** 492,499 /* * Extract the common name from the certificate. -* -* XXX: Should support alternate names here */ /* First find out the name's length and allocate a buffer for it. */ len = X509_NAME_get_text_by_NID(X509_get_subject_name(conn-peer), --- 493,498 *** verify_peer_name_matches_certificate(PGc *** 556,565 result = true; else { printfPQExpBuffer(conn-errorMessage, ! libpq_gettext(server common name \%s\ does not match host name \%s\\n), peer_cn, conn-pghost); - result = false; } } --- 555,627 result = true; else { + inti; + intalt_names_total; + STACK_OF(GENERAL_NAME) *alt_names; + + /* Get the list and the total number of alternative names. */ + alt_names = (STACK_OF(GENERAL_NAME) *) X509_get_ext_d2i(conn-peer, NID_subject_alt_name, NULL, NULL); + if (alt_names != NULL) + alt_names_total = sk_GENERAL_NAME_num(alt_names); + else + alt_names_total = 0; + + result = false; + + /* +* Compare the alternative names dnsNames identifies against +* the originally given hostname. +*/ + for (i = 0; i alt_names_total; i++) + { + const GENERAL_NAME *name = sk_GENERAL_NAME_value(alt_names, i); + if (name-type == GEN_DNS) + { + intreported_len; + intactual_len; + char *dns_namedata, + *dns_name; + + reported_len = ASN1_STRING_length(name-d.dNSName); + /* GEN_DNS can be only IA5String, equivalent to US ASCII */ + dns_namedata = (char *) ASN1_STRING_data(name-d.dNSName); + + dns_name = malloc(reported_len + 1); + memcpy(dns_name, dns_namedata, reported_len); + dns_name[reported_len] = '\0'; + +
Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
Tomas Vondra t...@fuzzy.cz writes: I stopped the already running test on addax and started the test on barnacle again. Let's see in a few days/weeks/months what is the result. It seems to be running much faster (probably after removing the randomization), and apparently it passed the create_view tests without crashing, but then crashed at 'constraints' (again, because of OOM). I poked into the constraints test and soon found another leak just like the other one :-(, which I've now fixed. I lack the patience to let constraints run to conclusion with C_C_R on --- I let it run for about 20 hours and it still was only maybe 1/4th done --- but it got further than it did in your report and there was zero sign of leakage. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
On 24.8.2014 18:01, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: I stopped the already running test on addax and started the test on barnacle again. Let's see in a few days/weeks/months what is the result. It seems to be running much faster (probably after removing the randomization), and apparently it passed the create_view tests without crashing, but then crashed at 'constraints' (again, because of OOM). I poked into the constraints test and soon found another leak just like the other one :-(, which I've now fixed. I lack the patience to let constraints run to conclusion with C_C_R on --- I let it run for about 20 hours and it still was only maybe 1/4th done --- but it got further than it did in your report and there was zero sign of leakage. OK. There was some outage in the facility hosting this machine, and it went up just a few hours ago. So I killed the current run, and it'll start chewing on this new commit in a few minutes. Regarding those leaks we've detected so far - is it the kind of leaks that can happen only in testing with those specific flags, or is it something that can happen in production too? (Assuming no one is running with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is, does it seem worth the effort running those tests / fixing those leaks? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
Tomas Vondra t...@fuzzy.cz writes: Regarding those leaks we've detected so far - is it the kind of leaks that can happen only in testing with those specific flags, or is it something that can happen in production too? (Assuming no one is running with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is, does it seem worth the effort running those tests / fixing those leaks? I believe most or all of these leaks were understood and intentionally ignored in the original coding, on the grounds that they were intraquery leaks and no real-world situation would ever cause so many cache reloads in a single query that the leakage would amount to anything problematic. I think that reasoning is still valid for production usage. It seems worth fixing the leaks in HEAD so that we can get through the regression tests on barnacle and see if anything of greater significance turns up --- but if this is all we find, it might not have been worth the trouble. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
On 24.8.2014 18:28, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: Regarding those leaks we've detected so far - is it the kind of leaks that can happen only in testing with those specific flags, or is it something that can happen in production too? (Assuming no one is running with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is, does it seem worth the effort running those tests / fixing those leaks? I believe most or all of these leaks were understood and intentionally ignored in the original coding, on the grounds that they were intraquery leaks and no real-world situation would ever cause so many cache reloads in a single query that the leakage would amount to anything problematic. I think that reasoning is still valid for production usage. It seems worth fixing the leaks in HEAD so that we can get through the regression tests on barnacle and see if anything of greater significance turns up --- but if this is all we find, it might not have been worth the trouble. OK. Some time ago we got a report (on the czech mailing list) with this: TopMemoryContext: 1375320 total in 168 blocks; 6472 free (18 chunks); 1368848 used ... CacheMemoryContext: 232883248 total in 5251 blocks; 5644000 free (2 chunks); 227239248 used Apparently they're using some sort of persistent connections, and there's ~8000 tables in that particular database, which eventually leads to OOM for them. Could this be related? Anyway, let's leave the tests running - either we find something interesting or not. I think it's worth it. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Aug 21, 2014 at 6:20 PM, Josh Berkus j...@agliodbs.com wrote: On 08/20/2014 03:42 PM, Arthur Silva wrote: What data are you using right now Josh? The same data as upthread. Can you test the three patches (9.4 head, 9.4 with Tom's cleanup of Heikki's patch, and 9.4 with Tom's latest lengths-only) on your workload? I'm concerned that my workload is unusual and don't want us to make this decision based entirely on it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com Here's my test results so far with the github archive data. It's important to keep in mind that the PushEvent event objects that I use in the queries only contains a small number of keys (8 to be precise), so these tests don't really stress the changed code. Anyway, in this dataset (with the small objects) using the all-lengths patch provide small compression savings but the overhead is minimal. Test data: 610MB of Json -- 341969 items Index size (jsonb_ops): 331MB Test query 1: SELECT data-'url', data-'actor' FROM t_json WHERE data @ '{type: PushEvent}' Test query 1 items: 169732 Test query 2: SELECT data FROM t_json WHERE data @ '{type: PushEvent}' Test query 2 items: HEAD (aka, all offsets) EXTENDED Size: 374MB Toast Size: 145MB Test query 1 runtime: 680ms Test query 2 runtime: 405ms HEAD (aka, all offsets) EXTERNAL Size: 366MB Toast Size: 333MB Test query 1 runtime: 505ms Test query 2 runtime: 350ms All Lengths (Tom Lane patch) EXTENDED Size: 379MB Toast Size: 108MB Test query 1 runtime: 720ms Test query 2 runtime: 420ms All Lengths (Tom Lane patch) EXTERNAL Size: 366MB Toast Size: 333MB Test query 1 runtime: 525ms Test query 2 runtime: 355ms -- Arthur Silva
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On 22 August 2014 23:02, Alvaro Herrera alvhe...@2ndquadrant.com wrote: heap_lock_tuple() has the following comment on top: * In the failure cases, the routine fills *hufd with the tuple's t_ctid, * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax * (the last only for HeapTupleSelfUpdated, since we * cannot obtain cmax from a combocid generated by another transaction). * See comments for struct HeapUpdateFailureData for additional info. With the patch as submitted, this struct is not filled in the HeapTupleWouldBlock case. I'm not sure this is okay, though I admit the only caller that passes LockWaitSkip doesn't care, so maybe we could just ignore the issue (after properly modifying the comment). It seems easy to add a LockBuffer() and goto failed rather than a return; that would make that failure case conform to HeapTupleUpdated and HeapTupleSelfUpdated. OTOH it might be pointless to worry about what would be essentially dead code currently ... Thanks Alvaro. Forgive me if I have misunderstood but it looks like your incremental patch included a couple of unrelated changes, namely s/0/InvalidCommandId/ and a reversion of ConditionalMultiXactIdWait. Undoing those gave me skip-locked-v12-b.patch, attached for reference, and I've included those changes in a new full patch skip-locked-v13.patch (also rebased). +1 for the change from if-then-else to switch statements. I was less sure about the 'goto failed' change, but I couldn't measure any change in concurrent throughput in my simple benchmark, so I guess that extra buffer lock/unlock doesn't matter and I can see why you prefer that control flow. Did you consider heap_lock_updated_tuple? A rationale for saying it doesn't need to pay attention to the wait policy is: if you're trying to lock-skip-locked an updated tuple, then you either skip it because its updater is running, or you return it because it's no longer running; and if you return it, it is not possible for the updater to be locking the updated version. However, what if there's a third transaction that locked the updated version? It might be difficult to hit this case or construct an isolationtester spec file though, since there's a narrow window you need to race to. Hmm. I will look into this, thanks. Best regards, Thomas Munro diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 31073bc..464a73c 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -4222,22 +4222,27 @@ l3: */ if (!have_tuple_lock) { - if (wait_policy == LockWaitBlock) + switch (wait_policy) { -LockTupleTuplock(relation, tid, mode); - } - else if (wait_policy == LockWaitError) - { -if (!ConditionalLockTupleTuplock(relation, tid, mode)) - ereport(ERROR, - (errcode(ERRCODE_LOCK_NOT_AVAILABLE), - errmsg(could not obtain lock on row in relation \%s\, - RelationGetRelationName(relation; - } - else /* wait_policy == LockWaitSkip */ - { -if (!ConditionalLockTupleTuplock(relation, tid, mode)) - return HeapTupleWouldBlock; +case LockWaitBlock: + LockTupleTuplock(relation, tid, mode); + break; +case LockWaitSkip: + if (!ConditionalLockTupleTuplock(relation, tid, mode)) + { + result = HeapTupleWouldBlock; + /* recovery code expects to have buffer lock held */ + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); + goto failed; + } + break; +case LockWaitError: + if (!ConditionalLockTupleTuplock(relation, tid, mode)) + ereport(ERROR, +(errcode(ERRCODE_LOCK_NOT_AVAILABLE), + errmsg(could not obtain lock on row in relation \%s\, + RelationGetRelationName(relation; + break; } have_tuple_lock = true; } @@ -4441,34 +4446,36 @@ l3: if (status = MultiXactStatusNoKeyUpdate) elog(ERROR, invalid lock mode in heap_lock_tuple); -/* wait for multixact to end */ -if (wait_policy == LockWaitBlock) +/* wait for multixact to end, or die trying */ +switch (wait_policy) { - MultiXactIdWait((MultiXactId) xwait, status, infomask, - relation, tuple-t_data-t_ctid, - XLTW_Lock, NULL); -} -else if (wait_policy == LockWaitError) -{ - if (!ConditionalMultiXactIdWait((MultiXactId) xwait, - status, infomask, relation, - tuple-t_data-t_ctid, - XLTW_Lock, NULL)) - ereport(ERROR, -(errcode(ERRCODE_LOCK_NOT_AVAILABLE), - errmsg(could not obtain lock on row in relation \%s\, - RelationGetRelationName(relation; -} -else /* wait_policy == LockWaitSkip */ -{ - if (!ConditionalMultiXactIdWait((MultiXactId) xwait, - status, infomask, relation, - tuple-t_data-t_ctid, - XLTW_Lock, NULL)) - { - UnlockTupleTuplock(relation, tid, mode); - return HeapTupleWouldBlock; - } + case
Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch
Thanks for the replies and thoughts. On 08/19/14 18:27, Heikki Linnakangas wrote: On 08/20/2014 12:17 AM, John Lumby wrote: I am attaching a new version of the patch for consideration in the current commit fest. Thanks for working on this! Relative to the one I submitted on 25 June in bay175-w412ff89303686022a9f16aa3...@phx.gbl the method for handling aio completion using sigevent has been re-written to use signals exclusively rather than a composite of signals and LWlocks, and this has fixed the problem I mentioned before with the LWlock method. ISTM the patch is still allocating stuff in shared memory that really doesn't belong there. Namely, the BufferAiocb structs. Or at least parts of it; there's also a waiter queue there which probably needs to live in shared memory, but the rest of it does not. Actually the reason the BufferAiocb ( the postgresql block corresponding to the aio's aiocb) must be located in shared memory is that, as you said, it acts as anchor for the waiter list. See further comment below. At least BufAWaitAioCompletion is still calling aio_error() on an AIO request that might've been initiated by another backend. That's not OK. Yes, you are right, and I agree with this one - I will add a aio_error_return_code field in the BufferAiocb and only the originator will set this from the real aiocb. Please write the patch without atomic CAS operation. Just use a spinlock. Umm, this is a new criticism I think. I use CAS for things other than locking, such as add/remove from shared queue. I suppose maybe a spinlock on the entire queue can be used equivalently, but with more code (extra confusion) and worse performance (coarser serialization). What is your objection to using gcc's atomic ops? Portability? There's a patch in the commitfest to add support for that, sorry, support for what? There are already spinlocks in postgresql, you mean some new kind?please point me at it with hacker msgid or something. but it's not committed yet, and all those USE_AIO_ATOMIC_BUILTIN_COMP_SWAP ifdefs make the patch more difficult to read. Same with all the other #ifdefs; please just pick a method that works. Ok, yes, the ifdefs are unpleasant.I will do something about that. Ideally they would be entirely confined to header files and only macro functions in C files - maybe I can do that. And eventually when the dust has settled eliminate obsolete ifdef blocks altogether. Also, please split prefetching of regular index scans into a separate patch. It's orthogonal to doing async I/O; actually not completely orthogonal, see next we could prefetch regular index scans with posix_fadvise too, and AIO should be useful for prefetching other stuff. On 08/19/14 19:10, Claudio Freire wrote: On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Also, please split prefetching of regular index scans into a separate patch. ... That patch already happened on the list, and it wasn't a win in many cases. I'm not sure it should be proposed independently of this one. Maybe a separate patch, but it should be considered dependent on this. I don't have an archive link at hand atm, but I could produce one later. Several people have asked to split this patch into several smaller ones and I have thought about it. It would introduce some awkward dependencies. E.g. splitting the scanner code (index, relation-heap) into separate patch from aio code would mean that the scanner patch becomes dependent on the aio patch. They are not quite orthogonal. The reason is that the scanners call a new function, DiscardBuffer(blockid) when aio is in use. We can get around it by providing a stub for that function in the scanner patch, but then there is some risk of someone getting the wrong version of that function in their build. It just adds yet more complexity and breakage opportunities. - Heikki One further comment concerning these BufferAiocb and aiocb control blocks being in shared memory : I explained above that the BufferAiocb must be in shared memory for wait/post. The aiocb does not necessarily have to be in shared memory, but since there is a one-to-one relationship between BufferAiocb and aiocb, it makes the code much simpler , and the operation much more efficient, if the aiocb is embedded in the BufferAiocb as I have it now. E.g. if the aiocb is in private-process memory, then an additional allocation scheme is needed (fixed number? palloc()in'g extra ones as needed? ...) which adds complexity, and probably some inefficiency since a shared pool is usually more efficient (allows higher maximum per process etc), and there would have to be some pointer de-referencing from BufferAiocb to aiocb adding some (small) overhead. I understood your objection to use of shared memory as being that you don't want a non-originator to access the originator's aiocb using
[HACKERS] Code bug or doc bug?
Folks, Quoth our docs (http://www.postgresql.org/docs/9.3/static/sql-alterdatabase.html): The fourth form changes the default tablespace of the database. Only the database owner or a superuser can do this; you must also have create privilege for the new tablespace. This command physically moves any tables or indexes in the database's old default tablespace to the new tablespace. Note that tables and indexes in non-default tablespaces are not affected. Yet: jberkus=# alter database phc set tablespace ssd; ERROR: some relations of database phc are already in tablespace ssd HINT: You must move them back to the database's default tablespace before using this command. Aside from being a stupid limitation (I need to copy the tables back to the old tablespace so that I can recopy them to the new one?), the above seems to be in direct contradiction to the docs. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On 24 August 2014 22:04, Thomas Munro mu...@ip9.org wrote: On 22 August 2014 23:02, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Did you consider heap_lock_updated_tuple? A rationale for saying it doesn't need to pay attention to the wait policy is: if you're trying to lock-skip-locked an updated tuple, then you either skip it because its updater is running, or you return it because it's no longer running; and if you return it, it is not possible for the updater to be locking the updated version. However, what if there's a third transaction that locked the updated version? It might be difficult to hit this case or construct an isolationtester spec file though, since there's a narrow window you need to race to. Hmm. I will look into this, thanks. While trying to produce the heap_lock_updated_tuple_rec case you describe (so far unsuccessfully), I discovered I could make SELECT ... FOR UPDATE NOWAIT block indefinitely on unpatched 9.3 in a different code path after heap_lock_tuple returns: in another session, UPDATE, COMMIT, then UPDATE, all after the first session has taken its snapshot but before it tries to lock a given row. The code in EvalPlanQualFetch (reached from the HeapTupleUpdated case in ExecLockRow) finishes up waiting for the uncommitted transaction. I think I see how to teach EvalPlanQualFetch how to handle wait policies: for NOWAIT it should ereport (fixing a pre-existing bug (?)), and I guess it should handle SKIP LOCKED by returning NULL, similarly to the way it handles deleted rows, and of course in all cases passing the wait policy forward to heap_lock_tuple, which it eventually calls. Still looking at heap_lock_updated_tuple. The difficulty of course will be testing all these racy cases reproducibly... Best regards, Thomas Munro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On 08/25/2014 09:44 AM, Thomas Munro wrote: On 24 August 2014 22:04, Thomas Munro mu...@ip9.org wrote: On 22 August 2014 23:02, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Did you consider heap_lock_updated_tuple? A rationale for saying it doesn't need to pay attention to the wait policy is: if you're trying to lock-skip-locked an updated tuple, then you either skip it because its updater is running, or you return it because it's no longer running; and if you return it, it is not possible for the updater to be locking the updated version. However, what if there's a third transaction that locked the updated version? It might be difficult to hit this case or construct an isolationtester spec file though, since there's a narrow window you need to race to. Hmm. I will look into this, thanks. While trying to produce the heap_lock_updated_tuple_rec case you describe (so far unsuccessfully), I discovered I could make SELECT ... FOR UPDATE NOWAIT block indefinitely on unpatched 9.3 in a different code path after heap_lock_tuple returns: in another session, UPDATE, COMMIT, then UPDATE, all after the first session has taken its snapshot but before it tries to lock a given row. The code in EvalPlanQualFetch (reached from the HeapTupleUpdated case in ExecLockRow) finishes up waiting for the uncommitted transaction. I think that's the issue Andres and I patched for 9.3, but I don't know if it got committed. I'll need to have a look. A search in the archives for heap_lock_tuple and nowait might be informative. The difficulty of course will be testing all these racy cases reproducibly... Yep, especially as isolationtester can only really work at the statement level, and can only handle one blocked connection at a time. It's possible a helper extension could be used - set up some locks in shmem, register two sessions for different test roles within a given test to install appropriate hooks, wait until they're both blocked on the locks the hooks acquire, then release the locks. That might land up with hook points scattered everywhere, but they could be some pretty minimal macros at least. IMO that's a separate project, though. For this kind of testing I've tended to just set conditional breakpoints in gdb, wait until they trap, then once everything's lined up release the breakpoints in both sessions as simultaneously as possible. Not perfect, but has tended to work. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
Thomas Munro wrote: While trying to produce the heap_lock_updated_tuple_rec case you describe (so far unsuccessfully), I discovered I could make SELECT ... FOR UPDATE NOWAIT block indefinitely on unpatched 9.3 in a different code path after heap_lock_tuple returns: in another session, UPDATE, COMMIT, then UPDATE, all after the first session has taken its snapshot but before it tries to lock a given row. The code in EvalPlanQualFetch (reached from the HeapTupleUpdated case in ExecLockRow) finishes up waiting for the uncommitted transaction. Interesting -- perhaps this helps explain the deadlock issue reported in http://www.postgresql.org/message-id/cakrjmhdn+ghajnwqfhsotgp+7yn27zr79m99rcajmnazt5n...@mail.gmail.com I think I see how to teach EvalPlanQualFetch how to handle wait policies: for NOWAIT it should ereport (fixing a pre-existing bug (?)), and I guess it should handle SKIP LOCKED by returning NULL, similarly to the way it handles deleted rows, and of course in all cases passing the wait policy forward to heap_lock_tuple, which it eventually calls. Still looking at heap_lock_updated_tuple. The difficulty of course will be testing all these racy cases reproducibly... Does this help? http://www.postgresql.org/message-id/51fb4305.3070...@2ndquadrant.com The useful trick there is forcing a query to get its snapshot and then go to sleep before actually doing anything, by way of an advisory lock. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER SYSTEM RESET?
On Wed, Jul 30, 2014 at 9:11 AM, Amit Kapila amit.kapil...@gmail.com wrote: I have verified the patch and found that it works well for all scenario's. Few minor suggestions: 1. !values to the filenamepostgresql.auto.conf/filename file. !Setting the parameter to literalDEFAULT/literal, or using the !commandRESET/command variant, removes the configuration entry from It would be better if we can write a separate line for RESET ALL as is written in case of both Alter Database and Alter Role in their respective documentation. 2. ! %type vsetstmt generic_set set_rest set_rest_more generic_reset reset_rest SetResetClause FunctionSetResetClause Good to break it into 2 lines. 3. I think we can add some text on top of function AlterSystemSetConfigFile() to explain functionality w.r.t reset all. I have updated the patch to address the above points. I will mark this patch as Ready For Committer as most of the review comments were already addressed by Vik and remaining I have addressed in attached patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com alter_system_reset.v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
Le 8 août 2014 09:08, Guillaume Lelarge guilla...@lelarge.info a écrit : Hi, As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0. We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula: greatest( (2 + checkpoint_completion_target) * checkpoint_segments + 1, checkpoint_segments + wal_keep_segments + 1 ) And we have lots of alerts from the script for customers who set their wal_keep_segments setting higher than 0. So we started to question this sentence of the documentation: There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files. (http://www.postgresql.org/docs/9.3/static/wal-configuration.html) While doing some tests, it appears it would be more something like: wal_keep_segments + (2 + checkpoint_completion_target) * checkpoint_segments + 1 But after reading the source code (src/backend/access/transam/xlog.c), the right formula seems to be: wal_keep_segments + 2 * checkpoint_segments + 1 Here is how we went to this formula... CreateCheckPoint(..) is responsible, among other things, for deleting and recycling old WAL files. From src/backend/access/transam/xlog.c, master branch, line 8363: /* * Delete old log files (those no longer needed even for previous * checkpoint or the standbys in XLOG streaming). */ if (_logSegNo) { KeepLogSeg(recptr, _logSegNo); _logSegNo--; RemoveOldXlogFiles(_logSegNo, recptr); } KeepLogSeg(...) function takes care of wal_keep_segments. From src/backend/access/transam/xlog.c, master branch, line 8792: /* compute limit for wal_keep_segments first */ if (wal_keep_segments 0) { /* avoid underflow, don't go below 1 */ if (segno = wal_keep_segments) segno = 1; else segno = segno - wal_keep_segments; } IOW, the segment number (segno) is decremented according to the setting of wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via _logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can remove or recycle all files before this segment number. This function gets the number of WAL files to recycle with the XLOGfileslop constant, which is defined as: /* * XLOGfileslop is the maximum number of preallocated future XLOG segments. * When we are done with an old XLOG segment file, we will recycle it as a * future XLOG segment as long as there aren't already XLOGfileslop future * segments; else we'll delete it. This could be made a separate GUC * variable, but at present I think it's sufficient to hardwire it as * 2*CheckPointSegments+1. Under normal conditions, a checkpoint will free * no more than 2*CheckPointSegments log segments, and we want to recycle all * of them; the +1 allows boundary cases to happen without wasting a * delete/create-segment cycle. */ #define XLOGfileslop(2*CheckPointSegments + 1) (in src/backend/access/transam/xlog.c, master branch, line 100) IOW, PostgreSQL will keep wal_keep_segments WAL files before the current WAL file, and then there may be 2*CheckPointSegments + 1 recycled ones. Hence the formula: wal_keep_segments + 2 * checkpoint_segments + 1 And this is what we usually find in our customers' servers. We may find more WAL files, depending on the write activity of the cluster, but in average, we get this number of WAL files. AFAICT, the documentation is wrong about the usual number of WAL files in the pg_xlog directory. But I may be wrong, in which case, the documentation isn't clear enough for me, and should be fixed so that others can't misinterpret it like I may have done. Any comments? did I miss something, or should we fix the documentation? Thanks. Ping?
Re: [HACKERS] WIP Patch for GROUPING SETS phase 1
Robert == Robert Haas robertmh...@gmail.com writes: Robert I can accept ugly code, but I feel strongly that we shouldn't Robert accept ugly semantics. Forcing cube to get out of the way Robert may not be pretty, but I think it will be much worse if we Robert violate the rule that quoting a keyword strips it of its Robert special meaning; or the rule that there are four kinds of Robert keywords and, if a keyword of a particular class is accepted Robert as an identifier in a given context, all other keywords in Robert that class will also be accepted as identifiers in that Robert context. Violating those rules could have not-fun-at-all Robert consequences like needing to pass additional context Robert information to ruleutils.c's quote_identifier() function, or Robert not being able to dump and restore a query from an older Robert version even with --quote-all-identifiers. Renaming the cube Robert type will suck for people who are using it; but it will only Robert have to be done once; weird stuff like the above will be with Robert us forever. If you look at the latest patch post, there's a small patch in it that does nothing but unreserve the keywords and fix ruleutils to make deparse/parse work. The required fix to ruleutils is an example of violating your four kinds of keywords principle, but quoting keywords still works. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers