Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-08-24 Thread Amit Kapila
On Tue, Aug 19, 2014 at 4:27 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 Few more comments:

Some more comments:

1. I could see one shortcoming in the way the patch has currently
parallelize the
   work for --analyze-in-stages. Basically patch is performing the work for
each stage
   for multiple tables in concurrent connections that seems okay for the
cases when
   number of parallel connections is less than equal to number of tables,
but for
   the case when user has asked for more number of connections than number
of tables,
   then I think this strategy will not be able to use the extra
connections.

2. Similarly for the case of multiple databases, currently it will not be
able
   to use connections more than number of tables in each database because
the
   parallelizing strategy is to just use the conncurrent connections for
   tables inside single database.

I am not completely sure whether current strategy is good enough or
we should try to address the above problems.  What do you think?

3.
+ do
+ {
+ i = select_loop(maxFd, slotset);
+ Assert(i != 0);

Could you explain the reason of using this loop, I think you
want to wait for data on socket descriptor, but why for maxFd?
Also it is better if you explain this logic in comments.

4.
+ for (i = 0; i  max_slot; i++)
+ {
+ if (!FD_ISSET(pSlot[i].sock, slotset))
+ continue;
+
+ PQconsumeInput(pSlot[i].connection);
+ if (PQisBusy(pSlot[i].connection))
+ continue;

I think it is better to call PQconsumeInput() only if you find
connection is busy.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] pgbench throttling latency limit

2014-08-24 Thread Fabien COELHO



Add --limit to limit latency under throttling

Under throttling, transactions are scheduled for execution at certain times. 
Transactions may be far behind schedule and the system may catch up with the 
load later. This option allows to change this behavior by skipping 
transactions which are too far behind schedule, and count those as skipped.


The idea is to help simulate a latency-constrained environment such as a 
database used by a web server.


Find attached a new version:
 - fix dropped percent computation in the final report
 - simplify progress report code

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..37a4a8f 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,13 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * When under throttling, execution time slots which are more than
+ * this late (in us) are skipped, and the corresponding transaction
+ * will be counted as somehow aborted.
+ */
+int64		throttle_latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +245,7 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +258,7 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
 } TResult;
 
 /*
@@ -367,6 +376,8 @@ usage(void)
 		   -f, --file=FILENAME  read transaction script from FILENAME\n
 		 -j, --jobs=NUM   number of threads (default: 1)\n
 		 -l, --logwrite transaction times to log file\n
+		 -L, --limit=NUM  under throttling (--rate), skip transactions that\n
+		  far behind schedule in ms (default: do not skip)\n
 		 -M, --protocol=simple|extended|prepared\n
 		  protocol for submitting queries (default: simple)\n
 		 -n, --no-vacuum  do not run VACUUM before tests\n
@@ -1016,6 +1027,24 @@ top:
 
 		thread-throttle_trigger += wait;
 
+		if (throttle_latency_limit)
+		{
+			instr_time	now;
+			int64		now_us;
+			INSTR_TIME_SET_CURRENT(now);
+			now_us = INSTR_TIME_GET_MICROSEC(now);
+			while (thread-throttle_trigger  now_us - throttle_latency_limit)
+			{
+/* if too far behind, this slot is skipped, and we
+ * iterate till the next nearly on time slot.
+ */
+int64 wait = (int64) (throttle_delay *
+	1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));
+thread-throttle_trigger += wait;
+thread-throttle_latency_skipped ++;
+			}
+		}
+
 		st-until = thread-throttle_trigger;
 		st-sleeping = 1;
 		st-throttling = true;
@@ -2351,7 +2380,8 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 			 TState *threads, int nthreads,
 			 instr_time total_time, instr_time conn_total_time,
 			 int64 total_latencies, int64 total_sqlats,
-			 int64 throttle_lag, int64 throttle_lag_max)
+			 int64 throttle_lag, int64 throttle_lag_max,
+			 int64 throttle_latency_skipped)
 {
 	double		time_include,
 tps_include,
@@ -2417,6 +2447,10 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 		 */
 		printf(rate limit schedule lag: avg %.3f (max %.3f) ms\n,
 			   0.001 * throttle_lag / normal_xacts, 0.001 * throttle_lag_max);
+		if (throttle_latency_limit)
+			printf(number of skipped transactions:  INT64_FORMAT  (%.3f %%)\n,
+   throttle_latency_skipped,
+   100.0 * throttle_latency_skipped / (throttle_latency_skipped + normal_xacts));
 	}
 
 	printf(tps = %f (including connections establishing)\n, tps_include);
@@ -2505,6 +2539,7 @@ main(int argc, char **argv)
 		{sampling-rate, required_argument, NULL, 4},
 		{aggregate-interval, required_argument, NULL, 5},
 		{rate, required_argument, NULL, 'R'},
+		{limit, required_argument, NULL, 'L'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -2534,6 +2569,7 @@ main(int argc, char **argv)
 	int64		total_sqlats = 0;
 	int64		throttle_lag = 0;
 	int64		throttle_lag_max = 0;
+	int64		throttle_latency_skipped = 0;
 
 	int			i;
 
@@ -2775,6 +2811,18 @@ main(int argc, char **argv)
 	throttle_delay = (int64) (100.0 / throttle_value);
 }
 break;
+			case 'L':
+{
+	double limit_ms = atof(optarg);
+	if (limit_ms = 0.0)
+	{
+		fprintf(stderr, invalid latency limit: %s\n, optarg);
+		exit(1);
+	}
+	benchmarking_option_set = true;
+	throttle_latency_limit = (int64) (limit_ms * 1000);
+}
+break;
 			case 0:
 /* This covers long options which take no argument. */
 if (foreign_keys || unlogged_tables)
@@ -2895,6 +2943,12 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	if (throttle_latency_limit != 0  throttle_delay == 0)
+	{
+		fprintf(stderr, latency limit (-L) 

Re: [HACKERS] Parallel Sequence Scan doubts

2014-08-24 Thread Haribabu Kommi
On Sun, Aug 24, 2014 at 12:34 PM, Craig Ringer cr...@2ndquadrant.com wrote:
 On 08/24/2014 09:40 AM, Haribabu Kommi wrote:

 Any suggestions?

 Another point I didn't raise first time around, but that's IMO quite
 significant, is that you haven't addressed why this approach to fully
 parallel seqscans is useful and solves real problems in effective ways.

 It might seem obvious - of course they're useful. But I see two things
 they'd address:

 - CPU-limited sequential scans, where expensive predicates are filtering
 the scan; and

Yes, we are mainly targeting CPU-limited sequential scans, Because of
this reason
only I want the worker to handle the predicates also not just reading
the tuples from
disk.

 - I/O limited sequential scans, where the predicates already execute
 fast enough on one CPU, so most time is spent waiting for more disk I/O.

 The problem I see with your design is that it's not going to be useful
 for a large class of CPU-limited scans where the predicate isn't
 composed entirely of immutable functions an operators. Especially since
 immutable-only predicates are the best candidates for expression indexes
 anyway.

 While it'd likely be useful for I/O limited scans, it's going to
 increase contention on shared_buffers locking and page management. More
 importantly, is it the most efficient way to solve the problem with I/O
 limited scans?

 I would seriously suggest looking at first adding support for
 asynchronous I/O across ranges of extents during a sequential scan. You
 might not need multiple worker backends at all.

 I'm sure using async I/O to implement effective_io_concurrency for
 seqscans has been been discussed and explored before, so again I think
 some time in the list archives might make sense.

Thanks for your inputs. I will check it.

 I don't know if it makes sense to do something as complex and parallel
 multi-process seqscans without having a path forward for supporting
 non-immutable functions - probably with fmgr API enhancements,
 additional function labels (PARALLEL), etc, depending on what you find
 is needed.

Thanks for your inputs.
I will check for a solution to extend the support for non-immutable
functions also.

 3. In the executor Init phase, Try to copy the necessary data required
 by the workers and start the workers.

 Copy how?

 Back-ends can only communicate with each other over shared memory,
 signals, and using sockets.

 Sorry for not being clear, copying those data structures into dynamic
 shared memory only.
 From there the workers can access.

 That'll probably work with read-only data, but it's not viable for
 read/write data unless you use a big lock to protect it, in which case
 you lose the parallelism you want to achieve.

As of now I am thinking of sharing read-only data with workers.
In case if read/write data needs to be shared, then  we may need
another approach to handle the same.

 You'd have to classify what may be modified during scan execution
 carefully and determine if you need to feed any of the resulting
 modifications back to the original backend - and how to merge
 modifications by multiple workers, if it's even possible.

 That's going to involve a detailed structure-by-structure analysis and
 seems likely to be error prone and buggy.

Thanks for your inputs. I will check it properly.

 I think you should probably talk to Robert Haas about what he's been
 doing over the last couple of years on parallel query.

Sure, I will check with him.

 4. In the executor run phase, just get the tuples which are sent by
 the workers and process them further in the plan node execution.

 Again, how do you propose to copy these back to the main bgworker?

 With the help of message queues that are created in the dynamic shared 
 memory,
 the workers can send the data to the queue. On other side the main
 backend receives the tuples from the queue.

 OK, so you plan to implement shmem queues.

 That'd be a useful starting point, as it'd be something that would be
 useful in its own right.

shmem queues are already possible with dynamic shared memory.
Just I want to use them here.

 You'd have to be able to handle individual values that're than the ring
 buffer or whatever you're using for transfers, in case you're dealing
 with already-detoasted tuples or in-memory tuples.

 Again, chatting with Robert and others who've worked on dynamic shmem,
 parallel query, etc would be wise here.

 Yes you are correct. For that reason only I am thinking of Supporting
 of functions
 that only dependent on input variables and are not modifying any global data.

 You'll want to be careful with that. Nothing stops an immutable function
 referencing a cache in a C global that it initializes one and then
 treats as read only, for example.

 I suspect you'll need a per-function whitelist. I'd love to be wrong.

Yes, we need per-function level details. Once we have a better
solution to handle
non-immutable functions also then these may not be required.



Re: [HACKERS] implement subject alternative names support for SSL connections

2014-08-24 Thread Alexey Klyukin
On Wed, Aug 20, 2014 at 11:53 AM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 07/25/2014 07:10 PM, Alexey Klyukin wrote:

 Greetings,

 I'd like to propose a patch for checking subject alternative names entry
 in
 the SSL certificate for DNS names during SSL authentication.


 Thanks! I just ran into this missing feature last week, while working on
 my SSL test suite. So +1 for having the feature.

 This patch needs to be rebased over current master branch, thanks to my
 refactoring that moved all OpenSSL-specific stuff to be-secure-openssl.c.


The patch is rebased against fe-secure-openssl.c (that's where
verify_peer_name_matches_certificate appeared in the master branch), I've
changed the condition in the for loop to be less confusing (thanks to
comments from Magnus and Tom), making an explicit break once a match is
detected.

Note that It generates a lot of OpenSSL related warnings on my system (66
total) with clang, complaining about
$X is deprecated: first deprecated in OS X 10.7
[-Wdeprecated-declarations], but it does so for most other SSL functions,
so I don't think it's a problem introduced by this patch.

Sincerely,
Alexey.
diff --git a/src/interfaces/libpq/fe-secure-openssl.c 
b/src/interfaces/libpq/fe-secure-openssl.c
new file mode 100644
index f950fc3..b4f6bc9
*** a/src/interfaces/libpq/fe-secure-openssl.c
--- b/src/interfaces/libpq/fe-secure-openssl.c
***
*** 60,65 
--- 60,66 
  #ifdef USE_SSL_ENGINE
  #include openssl/engine.h
  #endif
+ #include openssl/x509v3.h
  
  static bool verify_peer_name_matches_certificate(PGconn *);
  static intverify_cb(int ok, X509_STORE_CTX *ctx);
*** wildcard_certificate_match(const char *p
*** 473,479 
  
  
  /*
!  *Verify that common name resolves to peer.
   */
  static bool
  verify_peer_name_matches_certificate(PGconn *conn)
--- 474,480 
  
  
  /*
!  *Verify that common name or any of the alternative dNSNames resolves to 
peer.
   */
  static bool
  verify_peer_name_matches_certificate(PGconn *conn)
*** verify_peer_name_matches_certificate(PGc
*** 492,499 
  
/*
 * Extract the common name from the certificate.
-*
-* XXX: Should support alternate names here
 */
/* First find out the name's length and allocate a buffer for it. */
len = X509_NAME_get_text_by_NID(X509_get_subject_name(conn-peer),
--- 493,498 
*** verify_peer_name_matches_certificate(PGc
*** 556,565 
result = true;
else
{
printfPQExpBuffer(conn-errorMessage,
! libpq_gettext(server 
common name \%s\ does not match host name \%s\\n),
  peer_cn, 
conn-pghost);
-   result = false;
}
}
  
--- 555,627 
result = true;
else
{
+   inti;
+   intalt_names_total;
+   STACK_OF(GENERAL_NAME) *alt_names;
+ 
+   /* Get the list and the total number of alternative 
names. */
+   alt_names = (STACK_OF(GENERAL_NAME) *) 
X509_get_ext_d2i(conn-peer, NID_subject_alt_name, NULL, NULL);
+   if (alt_names != NULL)
+   alt_names_total = 
sk_GENERAL_NAME_num(alt_names);
+   else
+   alt_names_total = 0;
+ 
+   result = false;
+ 
+   /*
+* Compare the alternative names dnsNames identifies 
against
+* the originally given hostname.
+*/
+   for (i = 0; i  alt_names_total; i++)
+   {
+   const GENERAL_NAME *name = 
sk_GENERAL_NAME_value(alt_names, i);
+   if (name-type == GEN_DNS)
+   {
+   intreported_len;
+   intactual_len;
+   char  *dns_namedata,
+ *dns_name;
+ 
+   reported_len = 
ASN1_STRING_length(name-d.dNSName);
+   /* GEN_DNS can be only IA5String, 
equivalent to US ASCII */
+   dns_namedata = (char *) 
ASN1_STRING_data(name-d.dNSName);
+ 
+   dns_name = malloc(reported_len + 1);
+   memcpy(dns_name, dns_namedata, 
reported_len);
+   dns_name[reported_len] = '\0';
+ 
+   

Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

2014-08-24 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 I stopped the already running test on addax and started the test on
 barnacle again. Let's see in a few days/weeks/months what is the result.

 It seems to be running much faster (probably after removing the
 randomization), and apparently it passed the create_view tests without
 crashing, but then crashed at 'constraints' (again, because of OOM).

I poked into the constraints test and soon found another leak just like
the other one :-(, which I've now fixed.  I lack the patience to let
constraints run to conclusion with C_C_R on --- I let it run for about
20 hours and it still was only maybe 1/4th done --- but it got further
than it did in your report and there was zero sign of leakage.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

2014-08-24 Thread Tomas Vondra
On 24.8.2014 18:01, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 I stopped the already running test on addax and started the test on
 barnacle again. Let's see in a few days/weeks/months what is the result.
 
 It seems to be running much faster (probably after removing the
 randomization), and apparently it passed the create_view tests without
 crashing, but then crashed at 'constraints' (again, because of OOM).
 
 I poked into the constraints test and soon found another leak just like
 the other one :-(, which I've now fixed.  I lack the patience to let
 constraints run to conclusion with C_C_R on --- I let it run for about
 20 hours and it still was only maybe 1/4th done --- but it got further
 than it did in your report and there was zero sign of leakage.

OK. There was some outage in the facility hosting this machine, and it
went up just a few hours ago. So I killed the current run, and it'll
start chewing on this new commit in a few minutes.

Regarding those leaks we've detected so far - is it the kind of leaks
that can happen only in testing with those specific flags, or is it
something that can happen in production too? (Assuming no one is running
with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is,
does it seem worth the effort running those tests / fixing those leaks?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

2014-08-24 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 Regarding those leaks we've detected so far - is it the kind of leaks
 that can happen only in testing with those specific flags, or is it
 something that can happen in production too? (Assuming no one is running
 with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is,
 does it seem worth the effort running those tests / fixing those leaks?

I believe most or all of these leaks were understood and intentionally
ignored in the original coding, on the grounds that they were intraquery
leaks and no real-world situation would ever cause so many cache reloads
in a single query that the leakage would amount to anything problematic.
I think that reasoning is still valid for production usage.  It seems
worth fixing the leaks in HEAD so that we can get through the regression
tests on barnacle and see if anything of greater significance turns up ---
but if this is all we find, it might not have been worth the trouble.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

2014-08-24 Thread Tomas Vondra
On 24.8.2014 18:28, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 Regarding those leaks we've detected so far - is it the kind of leaks
 that can happen only in testing with those specific flags, or is it
 something that can happen in production too? (Assuming no one is running
 with CLOBBER_CACHE_RECURSIVELY in production, of course ;-) That is,
 does it seem worth the effort running those tests / fixing those leaks?
 
 I believe most or all of these leaks were understood and intentionally
 ignored in the original coding, on the grounds that they were intraquery
 leaks and no real-world situation would ever cause so many cache reloads
 in a single query that the leakage would amount to anything problematic.
 I think that reasoning is still valid for production usage.  It seems
 worth fixing the leaks in HEAD so that we can get through the regression
 tests on barnacle and see if anything of greater significance turns up ---
 but if this is all we find, it might not have been worth the trouble.

OK. Some time ago we got a report (on the czech mailing list) with this:

TopMemoryContext: 1375320 total in 168 blocks; 6472 free (18 chunks);
1368848 used
  ...
  CacheMemoryContext: 232883248 total in 5251 blocks; 5644000 free (2
chunks); 227239248 used

Apparently they're using some sort of persistent connections, and
there's ~8000 tables in that particular database, which eventually leads
to OOM for them. Could this be related?

Anyway, let's leave the tests running - either we find something
interesting or not. I think it's worth it.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-24 Thread Arthur Silva
On Thu, Aug 21, 2014 at 6:20 PM, Josh Berkus j...@agliodbs.com wrote:

 On 08/20/2014 03:42 PM, Arthur Silva wrote:
  What data are you using right now Josh?

 The same data as upthread.

 Can you test the three patches (9.4 head, 9.4 with Tom's cleanup of
 Heikki's patch, and 9.4 with Tom's latest lengths-only) on your workload?

 I'm concerned that my workload is unusual and don't want us to make this
 decision based entirely on it.

 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com



Here's my test results so far with the github archive data.

It's important to keep in mind that the PushEvent event objects that I use
in the queries only contains a small number of keys (8 to be precise), so
these tests don't really stress the changed code.

Anyway, in this dataset (with the small objects) using the all-lengths
patch provide small compression savings but the overhead is minimal.



Test data: 610MB of Json -- 341969 items

Index size (jsonb_ops): 331MB

Test query 1: SELECT data-'url', data-'actor' FROM t_json WHERE data @
'{type: PushEvent}'
Test query 1 items: 169732

Test query 2: SELECT data FROM t_json WHERE data @ '{type: PushEvent}'
Test query 2 items:


HEAD (aka, all offsets) EXTENDED
Size: 374MB
Toast Size: 145MB

Test query 1 runtime: 680ms
Test query 2 runtime: 405ms

HEAD (aka, all offsets) EXTERNAL
Size: 366MB
Toast Size: 333MB

Test query 1 runtime: 505ms
Test query 2 runtime: 350ms

All Lengths (Tom Lane patch) EXTENDED
Size: 379MB
Toast Size: 108MB

Test query 1 runtime: 720ms
Test query 2 runtime: 420ms

All Lengths (Tom Lane patch) EXTERNAL
Size: 366MB
Toast Size: 333MB

Test query 1 runtime: 525ms
Test query 2 runtime: 355ms


--
Arthur Silva


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-24 Thread Thomas Munro
On 22 August 2014 23:02, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 heap_lock_tuple() has the following comment on top:

  * In the failure cases, the routine fills *hufd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax
  * (the last only for HeapTupleSelfUpdated, since we
  * cannot obtain cmax from a combocid generated by another transaction).
  * See comments for struct HeapUpdateFailureData for additional info.

 With the patch as submitted, this struct is not filled in the
 HeapTupleWouldBlock case.  I'm not sure this is okay, though I admit the
 only caller that passes LockWaitSkip doesn't care, so maybe we could
 just ignore the issue (after properly modifying the comment).  It seems
 easy to add a LockBuffer() and goto failed rather than a return; that
 would make that failure case conform to HeapTupleUpdated and
 HeapTupleSelfUpdated.  OTOH it might be pointless to worry about what
 would be essentially dead code currently ...

Thanks Alvaro.

Forgive me if I have misunderstood but it looks like your incremental
patch included a couple of unrelated changes, namely
s/0/InvalidCommandId/ and a reversion of ConditionalMultiXactIdWait.
Undoing those gave me skip-locked-v12-b.patch, attached for reference,
and I've included those changes in a new full patch
skip-locked-v13.patch (also rebased).

+1 for the change from if-then-else to switch statements.

I was less sure about the 'goto failed' change, but I couldn't measure
any change in concurrent throughput in my simple benchmark, so I guess
that extra buffer lock/unlock doesn't matter and I can see why you
prefer that control flow.

 Did you consider heap_lock_updated_tuple?  A rationale for saying it
 doesn't need to pay attention to the wait policy is: if you're trying to
 lock-skip-locked an updated tuple, then you either skip it because its
 updater is running, or you return it because it's no longer running; and
 if you return it, it is not possible for the updater to be locking the
 updated version.  However, what if there's a third transaction that
 locked the updated version?  It might be difficult to hit this case or
 construct an isolationtester spec file though, since there's a narrow
 window you need to race to.

Hmm.  I will look into this, thanks.

Best regards,
Thomas Munro
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 31073bc..464a73c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4222,22 +4222,27 @@ l3:
 		 */
 		if (!have_tuple_lock)
 		{
-			if (wait_policy == LockWaitBlock)
+			switch (wait_policy)
 			{
-LockTupleTuplock(relation, tid, mode);
-			}
-			else if (wait_policy == LockWaitError)
-			{
-if (!ConditionalLockTupleTuplock(relation, tid, mode))
-	ereport(ERROR,
-			(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
-	errmsg(could not obtain lock on row in relation \%s\,
-		   RelationGetRelationName(relation;
-			}
-			else /* wait_policy == LockWaitSkip */
-			{
-if (!ConditionalLockTupleTuplock(relation, tid, mode))
-	return HeapTupleWouldBlock;
+case LockWaitBlock:
+	LockTupleTuplock(relation, tid, mode);
+	break;
+case LockWaitSkip:
+	if (!ConditionalLockTupleTuplock(relation, tid, mode))
+	{
+		result = HeapTupleWouldBlock;
+		/* recovery code expects to have buffer lock held */
+		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+		goto failed;
+	}
+	break;
+case LockWaitError:
+	if (!ConditionalLockTupleTuplock(relation, tid, mode))
+		ereport(ERROR,
+(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
+ errmsg(could not obtain lock on row in relation \%s\,
+		RelationGetRelationName(relation;
+	break;
 			}
 			have_tuple_lock = true;
 		}
@@ -4441,34 +4446,36 @@ l3:
 if (status = MultiXactStatusNoKeyUpdate)
 	elog(ERROR, invalid lock mode in heap_lock_tuple);
 
-/* wait for multixact to end */
-if (wait_policy == LockWaitBlock)
+/* wait for multixact to end, or die trying  */
+switch (wait_policy)
 {
-	MultiXactIdWait((MultiXactId) xwait, status, infomask,
-	relation, tuple-t_data-t_ctid,
-	XLTW_Lock, NULL);
-}
-else if (wait_policy == LockWaitError)
-{
-	if (!ConditionalMultiXactIdWait((MultiXactId) xwait,
-  status, infomask, relation,
-	tuple-t_data-t_ctid,
-	XLTW_Lock, NULL))
-		ereport(ERROR,
-(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
- errmsg(could not obtain lock on row in relation \%s\,
-		RelationGetRelationName(relation;
-}
-else /* wait_policy == LockWaitSkip */
-{
-	if (!ConditionalMultiXactIdWait((MultiXactId) xwait,
-	status, infomask, relation,
-	tuple-t_data-t_ctid,
-	XLTW_Lock, NULL))
-	{
-		UnlockTupleTuplock(relation, tid, mode);
-		return HeapTupleWouldBlock;
-	}
+	case 

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-08-24 Thread johnlumby

Thanks for the replies and thoughts.

On 08/19/14 18:27, Heikki Linnakangas wrote:

On 08/20/2014 12:17 AM, John Lumby wrote:
I am attaching a new version of the patch for consideration in the 
current commit fest.


Thanks for working on this!

Relative to the one I submitted on 25 June in 
bay175-w412ff89303686022a9f16aa3...@phx.gbl
the method for handling aio completion using sigevent has been 
re-written to use

signals exclusively rather than a composite of signals and LWlocks,
and this has fixed the problem I mentioned before with the LWlock 
method.


ISTM the patch is still allocating stuff in shared memory that really 
doesn't belong there. Namely, the BufferAiocb structs. Or at least 
parts of it; there's also a waiter queue there which probably needs to 
live in shared memory, but the rest of it does not.


Actually the reason the BufferAiocb ( the postgresql block corresponding 
to the aio's aiocb)
must be located in shared memory is that, as you said,  it acts as 
anchor for the waiter list.

See further comment below.



At least BufAWaitAioCompletion is still calling aio_error() on an AIO 
request that might've been initiated by another backend. That's not OK.


Yes,   you are right,  and I agree with this one  -
I will add a aio_error_return_code field in the BufferAiocb
and only the originator will set this from the real aiocb.



Please write the patch without atomic CAS operation. Just use a spinlock. 


Umm,   this is a new criticism I think.   I use CAS for things other 
than locking,
such as add/remove from shared queue.   I suppose maybe a spinlock on 
the entire queue
can be used equivalently,  but with more code (extra confusion) and 
worse performance
(coarser serialization).  What is your objection to using gcc's 
atomic ops?   Portability?




There's a patch in the commitfest to add support for that,


sorry,  support for what? There are already spinlocks in postgresql,
you mean some new kind?please point me at it with hacker msgid or 
something.


but it's not committed yet, and all those 
USE_AIO_ATOMIC_BUILTIN_COMP_SWAP ifdefs make the patch more difficult 
to read. Same with all the other #ifdefs; please just pick a method 
that works.


Ok,  yes,   the ifdefs are unpleasant.I will do something about that.
Ideally they would be entirely confined to header files and only macro 
functions
in C files  -  maybe I can do that.   And eventually when the dust has 
settled

eliminate obsolete ifdef blocks altogether.

Also, please split prefetching of regular index scans into a separate 
patch. It's orthogonal to doing async I/O;


actually not completely orthogonal, see next

we could prefetch regular index scans with posix_fadvise too, and AIO 
should be useful for prefetching other stuff.


On 08/19/14 19:10, Claudio Freire wrote:

On Tue, Aug 19, 2014 at 7:27 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Also, please split prefetching of regular index scans into a separate patch. ...

That patch already happened on the list, and it wasn't a win in many
cases. I'm not sure it should be proposed independently of this one.
Maybe a separate patch, but it should be considered dependent on this.

I don't have an archive link at hand atm, but I could produce one later.
Several people have asked to split this patch into several smaller ones 
and I

have thought about it. It would introduce some awkward dependencies.
E.g.  splitting the scanner code  (index,  relation-heap) into separate 
patch

from aio code would mean that the scanner patch becomes dependent
on the aio patch. They are not quite orthogonal.

The reason is that the scanners call a new function, DiscardBuffer(blockid)
when aio is in use. We can get around it by providing a stub for 
that function

in the scanner patch,   but then there is some risk of someone getting the
wrong version of that function in their build. It just adds yet more 
complexity

and breakage opportunities.



- Heikki


One further comment concerning these BufferAiocb and aiocb control blocks
being in shared memory :

I explained above that the BufferAiocb must be in shared memory for 
wait/post.

The aiocb does not necessarily have to be in shared memory,
but since there is a one-to-one relationship between BufferAiocb and aiocb,
it makes the code much simpler ,  and the operation much more efficient,
if the aiocb is embedded in the BufferAiocb as I have it now.
E.g. if the aiocb is in private-process memory,   then an additional 
allocation

scheme is needed (fixed number?   palloc()in'g extra ones as needed?  ...)
which adds complexity,   and probably some inefficiency since a shared 
pool is usually
more efficient (allows higher maximum per process etc),   and there 
would have to be
some pointer de-referencing from BufferAiocb to aiocb adding some 
(small) overhead.


I understood your objection to use of shared memory as being that you 
don't want
a non-originator to access the originator's aiocb using 

[HACKERS] Code bug or doc bug?

2014-08-24 Thread Josh Berkus
Folks,

Quoth our docs
(http://www.postgresql.org/docs/9.3/static/sql-alterdatabase.html):

The fourth form changes the default tablespace of the database. Only
the database owner or a superuser can do this; you must also have create
privilege for the new tablespace. This command physically moves any
tables or indexes in the database's old default tablespace to the new
tablespace. Note that tables and indexes in non-default tablespaces are
not affected.

Yet:

jberkus=# alter database phc set tablespace ssd;
ERROR:  some relations of database phc are already in tablespace ssd
HINT:  You must move them back to the database's default tablespace
before using this command.

Aside from being a stupid limitation (I need to copy the tables back to
the old tablespace so that I can recopy them to the new one?), the above
seems to be in direct contradiction to the docs.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-24 Thread Thomas Munro
On 24 August 2014 22:04, Thomas Munro mu...@ip9.org wrote:
 On 22 August 2014 23:02, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Did you consider heap_lock_updated_tuple?  A rationale for saying it
 doesn't need to pay attention to the wait policy is: if you're trying to
 lock-skip-locked an updated tuple, then you either skip it because its
 updater is running, or you return it because it's no longer running; and
 if you return it, it is not possible for the updater to be locking the
 updated version.  However, what if there's a third transaction that
 locked the updated version?  It might be difficult to hit this case or
 construct an isolationtester spec file though, since there's a narrow
 window you need to race to.

 Hmm.  I will look into this, thanks.

While trying to produce the heap_lock_updated_tuple_rec case you
describe (so far unsuccessfully), I discovered I could make SELECT ...
FOR UPDATE NOWAIT block indefinitely on unpatched 9.3 in a different
code path after heap_lock_tuple returns: in another session, UPDATE,
COMMIT, then UPDATE, all after the first session has taken its
snapshot but before it tries to lock a given row.  The code in
EvalPlanQualFetch (reached from the HeapTupleUpdated case in
ExecLockRow) finishes up waiting for the uncommitted transaction.

I think I see how to teach EvalPlanQualFetch how to handle wait
policies: for NOWAIT it should ereport (fixing a pre-existing bug
(?)), and I guess it should handle SKIP LOCKED by returning NULL,
similarly to the way it handles deleted rows, and of course in all
cases passing the wait policy forward to heap_lock_tuple, which it
eventually calls.

Still looking at heap_lock_updated_tuple.

The difficulty of course will be testing all these racy cases reproducibly...

Best regards,
Thomas Munro


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-24 Thread Craig Ringer
On 08/25/2014 09:44 AM, Thomas Munro wrote:
 On 24 August 2014 22:04, Thomas Munro mu...@ip9.org wrote:
 On 22 August 2014 23:02, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Did you consider heap_lock_updated_tuple?  A rationale for saying it
 doesn't need to pay attention to the wait policy is: if you're trying to
 lock-skip-locked an updated tuple, then you either skip it because its
 updater is running, or you return it because it's no longer running; and
 if you return it, it is not possible for the updater to be locking the
 updated version.  However, what if there's a third transaction that
 locked the updated version?  It might be difficult to hit this case or
 construct an isolationtester spec file though, since there's a narrow
 window you need to race to.

 Hmm.  I will look into this, thanks.
 
 While trying to produce the heap_lock_updated_tuple_rec case you
 describe (so far unsuccessfully), I discovered I could make SELECT ...
 FOR UPDATE NOWAIT block indefinitely on unpatched 9.3 in a different
 code path after heap_lock_tuple returns: in another session, UPDATE,
 COMMIT, then UPDATE, all after the first session has taken its
 snapshot but before it tries to lock a given row.  The code in
 EvalPlanQualFetch (reached from the HeapTupleUpdated case in
 ExecLockRow) finishes up waiting for the uncommitted transaction.

I think that's the issue Andres and I patched for 9.3, but I don't know
if it got committed.

I'll need to have a look. A search in the archives for heap_lock_tuple
and nowait might be informative.

 The difficulty of course will be testing all these racy cases reproducibly...

Yep, especially as isolationtester can only really work at the statement
level, and can only handle one blocked connection at a time.

It's possible a helper extension could be used - set up some locks in
shmem, register two sessions for different test roles within a given
test to install appropriate hooks, wait until they're both blocked on
the locks the hooks acquire, then release the locks.

That might land up with hook points scattered everywhere, but they could
be some pretty minimal macros at least.

IMO that's a separate project, though. For this kind of testing I've
tended to just set conditional breakpoints in gdb, wait until they trap,
then once everything's lined up release the breakpoints in both sessions
as simultaneously as possible. Not perfect, but has tended to work.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-08-24 Thread Alvaro Herrera
Thomas Munro wrote:

 While trying to produce the heap_lock_updated_tuple_rec case you
 describe (so far unsuccessfully), I discovered I could make SELECT ...
 FOR UPDATE NOWAIT block indefinitely on unpatched 9.3 in a different
 code path after heap_lock_tuple returns: in another session, UPDATE,
 COMMIT, then UPDATE, all after the first session has taken its
 snapshot but before it tries to lock a given row.  The code in
 EvalPlanQualFetch (reached from the HeapTupleUpdated case in
 ExecLockRow) finishes up waiting for the uncommitted transaction.

Interesting -- perhaps this helps explain the deadlock issue reported in
http://www.postgresql.org/message-id/cakrjmhdn+ghajnwqfhsotgp+7yn27zr79m99rcajmnazt5n...@mail.gmail.com

 I think I see how to teach EvalPlanQualFetch how to handle wait
 policies: for NOWAIT it should ereport (fixing a pre-existing bug
 (?)), and I guess it should handle SKIP LOCKED by returning NULL,
 similarly to the way it handles deleted rows, and of course in all
 cases passing the wait policy forward to heap_lock_tuple, which it
 eventually calls.
 
 Still looking at heap_lock_updated_tuple.
 
 The difficulty of course will be testing all these racy cases reproducibly...

Does this help?
http://www.postgresql.org/message-id/51fb4305.3070...@2ndquadrant.com
The useful trick there is forcing a query to get its snapshot and then
go to sleep before actually doing anything, by way of an advisory lock.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER SYSTEM RESET?

2014-08-24 Thread Amit Kapila
On Wed, Jul 30, 2014 at 9:11 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
 I have verified the patch and found that it works well for
 all scenario's.  Few minor suggestions:

 1.
 !values to the filenamepostgresql.auto.conf/filename file.
 !Setting the parameter to literalDEFAULT/literal, or using the
 !commandRESET/command variant, removes the configuration entry
from

 It would be better if we can write a separate line for RESET ALL
 as is written in case of both Alter Database and Alter Role in their
 respective documentation.

 2.
 ! %type vsetstmt generic_set set_rest set_rest_more generic_reset
reset_rest SetResetClause FunctionSetResetClause

 Good to break it into 2 lines.

 3. I think we can add some text on top of function
 AlterSystemSetConfigFile() to explain functionality w.r.t reset all.

I have updated the patch to address the above points.

I will mark this patch as Ready For Committer as most of the
review comments were already addressed by Vik and remaining
I have addressed in attached patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


alter_system_reset.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2014-08-24 Thread Guillaume Lelarge
Le 8 août 2014 09:08, Guillaume Lelarge guilla...@lelarge.info a écrit :

 Hi,

 As part of our monitoring work for our customers, we stumbled upon an
issue with our customers' servers who have a wal_keep_segments setting
higher than 0.

 We have a monitoring script that checks the number of WAL files in the
pg_xlog directory, according to the setting of three parameters
(checkpoint_completion_target, checkpoint_segments, and wal_keep_segments).
We usually add a percentage to the usual formula:

 greatest(
   (2 + checkpoint_completion_target) * checkpoint_segments + 1,
   checkpoint_segments + wal_keep_segments + 1
 )

 And we have lots of alerts from the script for customers who set their
wal_keep_segments setting higher than 0.

 So we started to question this sentence of the documentation:

 There will always be at least one WAL segment file, and will normally not
be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1
or checkpoint_segments + wal_keep_segments + 1 files.

 (http://www.postgresql.org/docs/9.3/static/wal-configuration.html)

 While doing some tests, it appears it would be more something like:

 wal_keep_segments + (2 + checkpoint_completion_target) *
checkpoint_segments + 1

 But after reading the source code (src/backend/access/transam/xlog.c),
the right formula seems to be:

 wal_keep_segments + 2 * checkpoint_segments + 1

 Here is how we went to this formula...

 CreateCheckPoint(..) is responsible, among other things, for deleting and
recycling old WAL files. From src/backend/access/transam/xlog.c, master
branch, line 8363:

 /*
  * Delete old log files (those no longer needed even for previous
  * checkpoint or the standbys in XLOG streaming).
  */
 if (_logSegNo)
 {
 KeepLogSeg(recptr, _logSegNo);
 _logSegNo--;
 RemoveOldXlogFiles(_logSegNo, recptr);
 }

 KeepLogSeg(...) function takes care of wal_keep_segments. From
src/backend/access/transam/xlog.c, master branch, line 8792:

 /* compute limit for wal_keep_segments first */
 if (wal_keep_segments  0)
 {
 /* avoid underflow, don't go below 1 */
 if (segno = wal_keep_segments)
 segno = 1;
 else
 segno = segno - wal_keep_segments;
 }

 IOW, the segment number (segno) is decremented according to the setting
of wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via
_logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can
remove or recycle all files before this segment number. This function gets
the number of WAL files to recycle with the XLOGfileslop constant, which is
defined as:

 /*
  * XLOGfileslop is the maximum number of preallocated future XLOG
segments.
  * When we are done with an old XLOG segment file, we will recycle it as a
  * future XLOG segment as long as there aren't already XLOGfileslop future
  * segments; else we'll delete it.  This could be made a separate GUC
  * variable, but at present I think it's sufficient to hardwire it as
  * 2*CheckPointSegments+1.  Under normal conditions, a checkpoint will
free
  * no more than 2*CheckPointSegments log segments, and we want to recycle
all
  * of them; the +1 allows boundary cases to happen without wasting a
  * delete/create-segment cycle.
  */
 #define XLOGfileslop(2*CheckPointSegments + 1)

 (in src/backend/access/transam/xlog.c, master branch, line 100)

 IOW, PostgreSQL will keep wal_keep_segments WAL files before the current
WAL file, and then there may be 2*CheckPointSegments + 1 recycled ones.
Hence the formula:

 wal_keep_segments + 2 * checkpoint_segments + 1

 And this is what we usually find in our customers' servers. We may find
more WAL files, depending on the write activity of the cluster, but in
average, we get this number of WAL files.

 AFAICT, the documentation is wrong about the usual number of WAL files in
the pg_xlog directory. But I may be wrong, in which case, the documentation
isn't clear enough for me, and should be fixed so that others can't
misinterpret it like I may have done.

 Any comments? did I miss something, or should we fix the documentation?

 Thanks.


Ping?


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-08-24 Thread Andrew Gierth
 Robert == Robert Haas robertmh...@gmail.com writes:

 Robert I can accept ugly code, but I feel strongly that we shouldn't
 Robert accept ugly semantics.  Forcing cube to get out of the way
 Robert may not be pretty, but I think it will be much worse if we
 Robert violate the rule that quoting a keyword strips it of its
 Robert special meaning; or the rule that there are four kinds of
 Robert keywords and, if a keyword of a particular class is accepted
 Robert as an identifier in a given context, all other keywords in
 Robert that class will also be accepted as identifiers in that
 Robert context.  Violating those rules could have not-fun-at-all
 Robert consequences like needing to pass additional context
 Robert information to ruleutils.c's quote_identifier() function, or
 Robert not being able to dump and restore a query from an older
 Robert version even with --quote-all-identifiers.  Renaming the cube
 Robert type will suck for people who are using it; but it will only
 Robert have to be done once; weird stuff like the above will be with
 Robert us forever.

If you look at the latest patch post, there's a small patch in it that
does nothing but unreserve the keywords and fix ruleutils to make
deparse/parse work. The required fix to ruleutils is an example of
violating your four kinds of keywords principle, but quoting
keywords still works.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers