Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-09 Thread Heikki Linnakangas

On 09/07/2014 05:58 AM, Peter Geoghegan wrote:

+ Lehman and Yao don't require read locks, but assume that in-memory
+ copies of tree pages are unshared.  Postgres shares in-memory buffers
+ among backends.  As a result, we do page-level read locking on btree
+ pages in order to guarantee that no record is modified while we are
+ examining it.  This reduces concurrency but guarantees correct
+ behavior.  An advantage is that when trading in a read lock for a
+ write lock, we need not re-read the page after getting the write lock.
+ Since we're also holding a pin on the shared buffer containing the
+ page, we know that buffer still contains the page and is up-to-date.


This is the existing paragraph, just moved to different place in the README.


+ Although it could be argued that Lehman and Yao isn't followed to the
+ letter because single pages are read locked as the tree is descended,
+ this is at least necessary to support deletion, a common requirement
+ which LY hardly acknowledge.  Read locks also ensure that B-tree
+ pages are self-consistent (LY appear to assume atomic page reads and
+ writes).


This is just duplicating the existing paragraph. I don't see the point 
of this.



 Even with these read locks, following LY obviates the need
+ of earlier schemes to hold multiple locks concurrently when descending
+ the tree as part of servicing index scans (pessimistic lock coupling).
+ The addition of right-links at all levels, as well as the addition of
+ a page high key allows detection and dynamic recovery from
+ concurrent page splits (that is, splits between unlocking an internal
+ page, and subsequently locking its child page during a descent).  When
+ a page is first locked (at every level of a descent servicing an index
+ scan), we consider the need to move right:  if the scankey value is
+ less than (or sometimes less than or equal to) the page's existing
+ highkey value, a value which serves as an upper bound for values on
+ the page generally, then it must be necessary to move the scan to the
+ right-hand page on the same level.  It's even possible that the scan
+ needs to move right more than once.  Once the other session's
+ concurrent page split finishes, a downlink will be inserted into the
+ parent, and so assuming there are no further page splits, future index
+ scans using the same scankey value will not need to move right.  LY
+ Trees are sometimes referred to as B-Link trees in the literature.


This explains in a few sentences what a LY B-tree looks like. The 
current README assumes that you're already familiar with what a LY tree 
looks like, or that you go read the paper mentioned at the top of the 
README. It might be a good idea to expand on that, and add an 
introduction like this so that an unfamiliar reader doesn't need to read 
the LY paper first. Is that the purpose of this patch? Please make it 
more explicit. And please make the sentences simpler - the above reads 
like a Shakespeare play. Something like:


The basic Lehman  Yao Algorithm


Compared to a classic B-tree, LY adds a right-link pointer to each 
page, to the page's right sibling. It also adds a high key to each 
page, which is an upper bound on the keys that are allowed on that page. 
These two additions make it possible detect a concurrent page split, 
which allows the tree to be searched without holding any read locks 
(except to keep a single page from being modified while reading it).


When a search follows a downlink to a child page, it compares the page's 
high key with the search key. If the search key is greater than the high 
key, the page must've been split concurrently, and you must follow the 
right-link to find the new page containing the key range you're looking 
for. This might need to be repeated, if the page has been split more 
than once.


Differences to the Lehman  Yao algorithm
=

(current Lehamn and Yao Algorithm and Insertions section)



I think that's pretty much the same information you added, but it's in a 
separate section, with the clear purpose that it explains what a LY 
tree looks like. You can skip over it, if you have read the paper or are 
otherwise already familiar with it. It still assumes that you're 
familiar with B-trees in general.


Anyway, I see that you had resurrected this in the commitfest app after 
three weeks of inactivity. I'm going to mark this back to Returned with 
Feedback. Please don't resurrect it again, this patch has received more 
than its fair share of attention. Instead, please help by signing up to 
review a patch. The commitfest progress is glacial at the moment, and we 
really need experienced reviewers like you to get closure to people's 
patches.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-09 Thread Marko Tiikkaja

On 2014-09-09 07:54, Craig Ringer wrote:

On 09/05/2014 05:21 PM, Pavel Stehule wrote:


*shrug*  Doing it in SQL would probably break more stuff.  I'm trying to
contain the damage.  And arguably, this is mostly only useful in PL/PgSQL.


I've wanted assertions in SQL enough that I often write trivial wrappers
around `raise` in PL/PgSQL for use in `CASE` statements etc.


Yeah, as have I.  I've also advocated that there should be a 
raise_exception(text, anyelement) anyelement  function shipped with 
postgres.


But this is something different; this is just a single statement which 
asserts that some expression evaluates to true.  Even if we allowed it 
to be used as a scalar expression, there's still the problem anyelement 
is commonly used to work around.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-09 Thread Heikki Linnakangas

On 08/15/2014 11:55 AM, Marko Tiikkaja wrote:

Hi,

On 8/8/14 3:18 PM, I wrote:

Currently there's no way to generate or extract armor headers from the
PGP armored format in pgcrypto.  I've written a patch to add the
support.


Latest version of the patch here, having fixed some small coding issues.


This coding style gives me the willies:


guess_len = pgp_armor_enc_len(data_len, num_headers, keys, values);
res = palloc(VARHDRSZ + guess_len);

res_len = pgp_armor_encode((uint8 *) VARDATA(data), data_len,
   (uint8 *) 
VARDATA(res),
   num_headers, keys, 
values);
if (res_len  guess_len)
ereport(ERROR,

(errcode(ERRCODE_EXTERNAL_ROUTINE_INVOCATION_EXCEPTION),
 errmsg(Overflow - encode estimate too 
small)));


That was OK before this patch, as the length calculation was simple 
enough to verify (although if I were writing it from scratch, I would've 
written it differently). But with this patch, it gets a lot more 
complicated, and I can't easily convince myself that it's correct.


pgp_armor_enc_len might be vulnerable to integer overflow. Consider 1GB 
worth of keys, 1GB worth of values, and 1GB worth of data. I'm not sure 
if you can quite make it overflow a 32-bit unsigned integer, but at 
least you can get nervously close, e.g if you use use max-sized 
key/value arrays, with a single byte in each key and value. Oh, and if 
you use a single-byte server encoding and characters that get expanded 
to multiple bytes in UTF-8, you can go higher.


So I think this (and the corresponding dearmor code too) should be 
refactored to use a StringInfo that gets enlarged as needed, instead of 
hoping to guess the size correctly beforehand. To ease review, might 
make sense to do that as a separate patch over current sources, and the 
main patch on top of that.


BTW, I'm surprised that there is no function to get all the armor 
headers. You can only probe for a particular one with pgp_armor_headder, 
but there is no way to list them all, if you don't know what you're 
looking for.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-09 Thread Marko Tiikkaja

On 9/9/14 10:54 AM, Heikki Linnakangas wrote:

So I think this (and the corresponding dearmor code too) should be
refactored to use a StringInfo that gets enlarged as needed, instead of
hoping to guess the size correctly beforehand. To ease review, might
make sense to do that as a separate patch over current sources, and the
main patch on top of that.


Yeah, I thought the same thing while writing that code.  Thanks, I'll do 
it that way.



BTW, I'm surprised that there is no function to get all the armor
headers. You can only probe for a particular one with pgp_armor_headder,
but there is no way to list them all, if you don't know what you're
looking for.


Right.  The assumption was that due to the nature of how the headers are 
used, the user would always be looking for a certain, predetermined set 
of values.  But I don't oppose to adding a pgp_armor_header_keys() 
function to get the set of all keys, for example.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-09 Thread Andres Freund
On 2014-09-04 08:18:37 -0400, Robert Haas wrote:
 On Tue, Aug 5, 2014 at 11:55 AM, Robert Haas robertmh...@gmail.com wrote:
  On Sun, Jul 6, 2014 at 3:12 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   If you want to do that, it's fine with me.  What I would do is:
  
   - Back-patch the addition of the sparcv8+ stuff all the way.  If
   anyone's running anything older, let them complain...
   - Remove the special case for MIPS without gcc intrinsics only in
   master, leaving the back-branches broken.  If anyone cares, let them
   complain...
   - Nothing else.
 
  I've gone ahead and done the second of these things.
 
  Thanks.
 
  Andres, do you want to go take a stab at fixing the SPARC stuff?
 
  Will do, will probably take me till thursday to come up with the brain
  cycles.
 
  Ping?
 
 This has been pending for almost two months now and, at your request,
 my patch to make spinlocks act as compiler barriers is waiting behind
 it.  Can we please get this moving again soon, or can I commit that
 patch and you can fix this when you get around to it?

I finally pushed this. And once more I seriously got pissed at the poor
overall worldwide state of documentation and continously changing
terminology around this.

Sorry for taking this long :(

Do you have a current version of your patch to make them compiler
barriers?

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-09 Thread Heikki Linnakangas

On 09/03/2014 06:39 PM, Robert Haas wrote:

On Wed, Sep 3, 2014 at 11:02 AM, Marko Tiikkaja ma...@joh.to wrote:

On 9/3/14 4:46 PM, Robert Haas wrote:

Making it
suck more because you don't think it's as important as your feature
is, in my opinion, not cool.


I really can't see how that would make inheritance suck *more*.  You can't
do UPDATE .. ORDER BY now, and you wouldn't be able to do it after the
patch.  Yeah, sure, perhaps people using inheritance might feel left out,
but surely that would just motivate them to work on improving it.


I think it's entirely reasonable for us to require that all new SQL
features should be required to work with or without inheritance.  If
we took the opposition position, and said that the only things that
need to work with inheritance are the ones that existed at the time
inheritance was introduced, then we'd not need to worry about it not
only for this feature but for row-level security and SKIP LOCKED and
GROUPING SETS and, going back a bit further, materialized views and
security-barrier views and LATERAL and CTEs and on and on.  Perhaps
not all of those require any special handling for inheritance
hierarchies, but some of them surely did, and if even 10% of the SQL
features that we have added since table inheritance were allowed to
opt out of supporting it, we'd have a broken and unusable feature
today.

Now some people might argue that we have that anyway, but the fact is
that a lot of people are using inheritance today, even with all its
flaws, and they wouldn't be if there were a long laundry list of
limitations that didn't apply to regular tables.  We should be looking
to lift the limitations that currently exist, not add more.


I agree. If we are to support UPDATE .. ORDER BY .. LIMIT, it should 
work with inheritance. So the path forward is (using Marko's phrasing 
upthread):


   1) We put the LIMIT inside ModifyTable like this patch does.  This
doesn't prevent us from doing ORDER BY in the future, but helps numerous
people who today have to
   2) Someone rewrites how UPDATE works based on Tom's suggestion here:
http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us,
which allows us to support ORDER BY on all tables (or perhaps maybe not
FDWs, I don't know how those work).  The LIMIT functionality in this
patch is unaffected.

What's not clear to me is whether it make sense to do 1) without 2) ? Is 
UPDATE .. LIMIT without support for an ORDER BY useful enough? And if we 
apply this patch now, how much of it needs to be rewritten after 2) ? If 
the answers are yes and not much, then we should review this patch 
now, and put 2) on the TODO list. Otherwise 2) should do done first.


Etsuro, Kaigei, please take a look at the patch and try to make a guess 
on how much of this still needs to be rewritten if we do 2). If not 
much, then please continue to review it, with the aim of getting it into 
a committable state.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-09 Thread Amit Kapila
On Tue, Sep 9, 2014 at 2:32 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Sep 8, 2014 at 2:05 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  Another point about error handling is that to execute the sql in
  function pg_background_worker_main(), it starts the transaction
  which I think doesn't get aborted if error occurs
 
  For this I was just referring error handling code of
  StartBackgroundWorker(), however during shutdown of process, it
  will call AbortOutOfAnyTransaction() which will take care of aborting
  transaction.

 Yeah.

  and similarly handling
  for timeout seems to be missing in error path.
 
  As we are anyway going to exit on error, so not sure, if this will be
  required, however having it for clean exit seems to be better.

 Can you be more specific?

I was thinking to handle errors similar to what PostgreMain does,
however if it is going to exit then it might no be worth.

 I'm generally of the view that there's little point in spending time
 cleaning things up that will go away anyway when the process exits.

Yeah, in that case it might not make much sense.  The argument here
could be why it has to exit, why can't it wait till the launcher asks it
to exit.  You have mentioned in previous mail that you want to stick to
the approach taken by patch, however it is not clear why you think that
is better.  If I try to think about how the worker backend should behave
incase the master backend assigns some task to worker, then I think it
would be sensible for worker to not exit after completing it's task (either
it has finished the execution of work assigned or otherwise) as master
backend can assign further tasks to the same worker.  Another advantage
would be that setup and tear down cost of worker will be saved.  Now If
we just think w.r.t pg_background it might not be so important to let
worker wait till launcher asks to quit, however as you have mentioned
in your mail upthread that you are expecting this kind of set-up for actual
parallelism, so it might not be completely insensible to expect worker
to wait till it has been asked to quit.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-09 Thread Amit Kapila
On Tue, Sep 9, 2014 at 3:11 AM, Thom Brown t...@linux.com wrote:
 On 5 September 2014 14:19, Amit Kapila amit.kapil...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
 
  Apart from above, I think for this patch, cat version bump is required
  as I have modified system catalog.  However I have not done the
  same in patch as otherwise it will be bit difficult to take performance
  data.

 One regression failed on linux due to spacing issue which is
 fixed in attached patch.


 Here's a set of test results against this patch:

Many thanks for taking the performance data.  This data clearly shows
that there is a performance improvement at even lower configuration,
however the real benefit of the patch can be seen with higher core
m/c and with larger RAM (can contain all the data).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Escaping from blocked send() reprised.

2014-09-09 Thread Kyotaro HORIGUCHI
Hi, I added and edited some comments.

 Sorry, It tha patch contains a silly bug. Please find the
 attatched one.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From eb91a7c91e1fd3b24bf5bff0eb885f1c3d274637 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp
Date: Fri, 5 Sep 2014 17:21:48 +0900
Subject: [PATCH] Simplly cutting off the socket if signalled during sending to client.

---
 src/backend/libpq/be-secure.c |   21 ++--
 src/backend/libpq/pqcomm.c|   13 +++
 src/backend/tcop/postgres.c   |   71 ++---
 src/include/libpq/libpq.h |1 +
 4 files changed, 70 insertions(+), 36 deletions(-)

diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 41ec1ad..3006697 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -145,11 +145,11 @@ secure_raw_read(Port *port, void *ptr, size_t len)
 {
 	ssize_t		n;
 
-	prepare_for_client_read();
+	prepare_for_client_comm();
 
 	n = recv(port-sock, ptr, len, 0);
 
-	client_read_ended();
+	client_comm_ended();
 
 	return n;
 }
@@ -178,5 +178,20 @@ secure_write(Port *port, void *ptr, size_t len)
 ssize_t
 secure_raw_write(Port *port, const void *ptr, size_t len)
 {
-	return send(port-sock, ptr, len, 0);
+	ssize_t		n;
+
+	/*
+	 * If we get interrupted during send under execution without blocking,
+	 * processing interrupt immediately actually throws away the chance to
+	 * complete sending the bytes handed, but the chance which we could send
+	 * one more tuple or maybe the final bytes has less not significance than
+	 * the risk that we might can't bail out forever due to blocking send.
+	 */
+	prepare_for_client_comm();
+
+	n = send(port-sock, ptr, len, 0);
+
+	client_comm_ended();
+
+	return n;
 }
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 605d891..9b08529 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1343,6 +1343,19 @@ pq_is_send_pending(void)
 }
 
 /* 
+ *		pq_is_busy	- is there any I/O command running?
+ *
+ *		This function is intended for use within signal handlers to check if
+ * 		any pqcomm I/O operation is under execution.
+ * 
+ */
+bool
+pq_is_busy(void)
+{
+	return PqCommBusy;
+}
+
+/* 
  * Message-level I/O routines begin here.
  *
  * These routines understand about the old-style COPY OUT protocol.
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7b5480f..b29b200 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -303,16 +303,16 @@ InteractiveBackend(StringInfo inBuf)
  *
  * Even though we are not reading from a client process, we still want to
  * respond to signals, particularly SIGTERM/SIGQUIT.  Hence we must use
- * prepare_for_client_read and client_read_ended.
+ * prepare_for_client_comm and client_comm_ended.
  */
 static int
 interactive_getc(void)
 {
 	int			c;
 
-	prepare_for_client_read();
+	prepare_for_client_comm();
 	c = getc(stdin);
-	client_read_ended();
+	client_comm_ended();
 	return c;
 }
 
@@ -487,53 +487,47 @@ ReadCommand(StringInfo inBuf)
 }
 
 /*
- * prepare_for_client_read -- set up to possibly block on client input
+ * prepare_for_client_comm -- set up to possibly block on client communication
  *
- * This must be called immediately before any low-level read from the
- * client connection.  It is necessary to do it at a sufficiently low level
- * that there won't be any other operations except the read kernel call
- * itself between this call and the subsequent client_read_ended() call.
+ * This must be called immediately before any low-level read from or write to
+ * the client connection.  It is necessary to do it at a sufficiently low
+ * level that there won't be any other operations except the read/write kernel
+ * call itself between this call and the subsequent client_comm_ended() call.
  * In particular there mustn't be use of malloc() or other potentially
- * non-reentrant libc functions.  This restriction makes it safe for us
- * to allow interrupt service routines to execute nontrivial code while
- * we are waiting for input.
+ * non-reentrant libc functions.  This restriction makes it safe for us to
+ * allow interrupt service routines to execute nontrivial code while we are
+ * waiting for input or blocking of output.
  */
 void
-prepare_for_client_read(void)
+prepare_for_client_comm(void)
 {
-	if (DoingCommandRead)
-	{
-		/* Enable immediate processing of asynchronous signals */
-		EnableNotifyInterrupt();
-		EnableCatchupInterrupt();
+	/* Enable immediate processing of asynchronous signals */
+	EnableNotifyInterrupt();
+	EnableCatchupInterrupt();
 
-		/* Allow cancel/die interrupts to be processed while waiting */
-		ImmediateInterruptOK = true;
+	/* Allow cancel/die interrupts to be processed while waiting */
+	ImmediateInterruptOK = true;
 
-		/* 

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-09 Thread Marko Tiikkaja

On 9/9/14 11:57 AM, Heikki Linnakangas wrote:

What's not clear to me is whether it make sense to do 1) without 2) ? Is
UPDATE .. LIMIT without support for an ORDER BY useful enough?


I'd say so; I could use it right now.  I have to remove millions of 
lines from a table, but I don't want the live transaction processing to 
take a hit, so I have to do it in batches.  Granted, some kind of rate 
limiting would achieve the same thing for DELETE, but with UPDATE you 
also have to consider row locking, and rate limiting wouldn't help with 
that at all; it would, in fact, just make it worse.  I'll also be 
running a big UPDATE like that later today, so that's two uses today 
alone for me.  And no, these are not routine things so keep your use 
partitions suggestions to yourselves.



And if we
apply this patch now, how much of it needs to be rewritten after 2) ? If
the answers are yes and not much, then we should review this patch
now, and put 2) on the TODO list. Otherwise 2) should do done first.


I'd say not much, if anything at all.


.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-09 Thread Marko Tiikkaja

On 9/9/14 12:37 PM, I wrote:

And no, these are not routine things so keep your use
partitions suggestions to yourselves.


My apologies.  This was not directed at you personally, Heikki, and in 
any case it was unnecessarily hostile.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-09 Thread Heikki Linnakangas

On 09/05/2014 06:38 PM, Jan Wieck wrote:

On 09/05/2014 10:12 AM, Fabien COELHO wrote:

Note that despite pg appaling latency performance, in may stay well over
the 90% limit, or even 100%: when things are going well a lot of
transaction run in about ms, while when things are going bad transactions
would take a long time (although possibly under or about 1s anyway), *but*
very few transactions are passed, the throughput is very small. The fact
that during 15 seconds only 30 transactions are processed is a detail that
does not show up in the metric.


Yeah, it makes much more sense to measure the latency from the 
scheduled time than the actual time.



I haven't used the real pgbench for a long time. I will have to look at
your patch and see what the current version actually does or does not.

What I have been using is a Python version of pgbench that I wrote for
myself when I started learning that language. That one does record both
values, the DB transaction latency and the client response time (time
from the request being entered into the Queue until transaction commit).
When I look at those results it is possible to have an utterly failing
run, with 60% of client response times being within 2 seconds, but all
the DB transactions are still in milliseconds.


I think we have to reconsider what we're reporting in 9.4, when --rate 
is enabled, even though it's already very late in the release cycle. 
It's a bad idea to change the definition of latency between 9.4 and 9.5, 
so let's get it right in 9.4.



As said, I'll have to take a look at it. Since I am on vacation next
week, getting ready for my first day at EnterpriseDB, this may actually
happen.


Oh, congrats! :-)

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-09 Thread Heikki Linnakangas

On 09/09/2014 01:46 PM, Marko Tiikkaja wrote:

On 9/9/14 12:37 PM, I wrote:

And no, these are not routine things so keep your use
partitions suggestions to yourselves.


My apologies.  This was not directed at you personally, Heikki, and in
any case it was unnecessarily hostile.


No worries, it made me smile :-)

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-09-09 Thread Heikki Linnakangas

On 09/09/2014 01:49 PM, Heikki Linnakangas wrote:

I think we have to reconsider what we're reporting in 9.4, when --rate
is enabled, even though it's already very late in the release cycle.
It's a bad idea to change the definition of latency between 9.4 and 9.5,
so let's get it right in 9.4.


As per the attached patch. I think we should commit this to 9.4. Any 
objections?


The text this patch adds to the documentation needs some rewording, 
though. As does this existing paragraph:



High rate limit schedule lag values, that is lag values that are
large compared to the actual transaction latency, indicate that
something is amiss in the throttling process. High schedule lag can
highlight a subtle problem there even if the target rate limit is met
in the end. One possible cause of schedule lag is insufficient
pgbench threads to handle all of the clients. To improve that,
consider reducing the number of clients, increasing the number of
threads in pgbench, or running pgbench on a separate host. Another
possibility is that the database is not keeping up with the load at
some point. When that happens, you will have to reduce the expected
transaction rate to lower schedule lag.


It took me ages to parse high rate limit schedule lag values.

- Heikki

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..bf9c4ea 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -1087,8 +1087,22 @@ top:
 			int64		latency;
 
 			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, st-txn_begin);
-			latency = INSTR_TIME_GET_MICROSEC(diff);
+
+			if (throttle_delay)
+			{
+/*
+ * Under throttling, compute latency from the scheduled start
+ * time, not the actual transaction start.
+ */
+int64 now = INSTR_TIME_GET_MICROSEC(diff);
+latency = now - thread-throttle_trigger;
+			}
+			else
+			{
+INSTR_TIME_SUBTRACT(diff, st-txn_begin);
+latency = INSTR_TIME_GET_MICROSEC(diff);
+			}
+
 			st-txn_latencies += latency;
 
 			/*
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index b479105..75fd8ec 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -403,8 +403,10 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
 Show progress report every literalsec/ seconds.  The report
 includes the time since the beginning of the run, the tps since the
 last report, and the transaction latency average and standard
-deviation since the last report.  Under throttling (option-R/), it
-also includes the average schedule lag time since the last report.
+deviation since the last report.  Under throttling (option-R/),
+the latency is computed with respect to the transaction scheduled
+start time, not the actual transaction beginning time, thus it also
+includes the average schedule lag time.
/para
   /listitem
  /varlistentry
@@ -452,6 +454,17 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
 output.
/para
para
+Also, when throttling is active, the transaction latency reported
+at the end of the run is computed from the scheduled start time, not
+the actual transaction start time, so it includes the lag time.
+The perspective is from the client point of view who wanted to perform
+a transaction but could not even start it for some time.  Very high
+latencies may be reported when the transactions are lagging behind
+schedule.  The transaction latency with respect to the actual
+transaction start time can be computed by substracting the transaction
+lag time from the reported latency.
+   /para
+   para
 High rate limit schedule lag values, that is lag values that are large
 compared to the actual transaction latency, indicate that something is
 amiss in the throttling process.  High schedule lag can highlight a subtle

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: BRIN indexes (was Re: [HACKERS] Minmax indexes)

2014-09-09 Thread Heikki Linnakangas

On 09/08/2014 07:02 PM, Alvaro Herrera wrote:

Here's version 18.  I have renamed it: These are now BRIN indexes.

I have fixed numerous race conditions and deadlocks.  In particular I
fixed this problem you noted:

Heikki Linnakangas wrote:

Another race condition:

If a new tuple is inserted to the range while summarization runs,
it's possible that the new tuple isn't included in the tuple that
the summarization calculated, nor does the insertion itself udpate
it.


I did it mostly in the way you outlined, i.e. by way of a placeholder
tuple that gets updated by concurrent inserters and then the tuple
resulting from the scan is unioned with the values in the updated
placeholder tuple.  This required the introduction of one extra support
proc for opclasses (pretty simple stuff anyhow).


Hmm. So the union support proc is only called if there is a race 
condition? That makes it very difficult to test, I'm afraid.


It would make more sense to pass BrinValues to the support functions, 
rather than DeformedBrTuple. An opclass'es support function should never 
need to access the values for other columns.


Does minmaxUnion handle NULLs correctly?

minmaxUnion pfrees the old values. Is that necessary? What memory 
context does the function run in? If the code runs in a short-lived 
memory context, you might as well just let them leak. If it runs in a 
long-lived context, well, perhaps it shouldn't. It's nicer to write 
functions that can leak freely. IIRC, GiST and GIN runs the support 
functions in a temporary context. In any case, it might be worth noting 
explicitly in the docs which functions may leak and which may not.


If you add a new datatype, and define b-tree operators for it, what is 
required to create a minmax opclass for it? Would it be possible to 
generalize the functions in brin_minmax.c so that they can be reused for 
any datatype (with b-tree operators) without writing any new C code? I 
think we're almost there; the only thing that differs between each data 
type is the opcinfo function. Let's pass the type OID as argument to the 
opcinfo function. You could then have just a single minmax_opcinfo 
function, instead of the macro to generate a separate function for each 
built-in datatype.


In general, this patch is in pretty good shape now, thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

2014-09-09 Thread Kyotaro HORIGUCHI
Hello, I will be the reviewer of this patch.

You approach that coloring tokens seems right, but you have
broken the parse logic by adding your code.

Other than the mistakes others pointed, I found that

- non-SQL-ident like tokens are ignored by their token style,
  quoted or not, so the following line works.

| local All aLL trust

I suppose this is not what you intended. This is because you have
igonred the attribute of a token when comparing it as
non-SQL-ident tokens.


- '+' at the head of the sequence '+' is treated as the first
  character of the *quoted* string. e.g. +hoge is tokenized as
  +hoge:special_quoted.

This is why you simply continued processing for '+' without
discarding and skipping the '+', and not setting in_quote so the
following parser code works as it is not intended. You should
understand what the original code does and insert or modify
logics not braeking the assumptions.

  With this patch, database (and role?) names are compared case-insensitively.
  For example:
  
  local  MixedDB all trust
  local  mixedDB all reject
  
  psql -d mixedDB
  psql (9.5devel)
  Type help for help.
  
  mixedDB=#
  
  That connection should've matched that 2nd line, and be rejected.
 
 Actually it should have matched neither, as both lines will get folded
 downcase:
 
 local  mixeddb all trust
 local  mixeddb all reject

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise() and pg_receivexlog

2014-09-09 Thread didier
Hi

 Well, I'd like to hear someone from the field complaining that
 pg_receivexlog is thrashing the cache and thus reducing the performance of
 some other process. Or a least a synthetic test case that demonstrates that
 happening.
It's not with pg_receivexlog but it's related.

On a small box without replication server connected perfs were good
enough but not so with a replication server connected, there was 1GB
worth of WAL sitting in RAM vs next to nothing without slave!
setup:
8GB RAM
2GB shared_buffers (smaller has other issues)
checkpoint_segments 40 (smaller value trigger too much xlog checkpoint)
checkpoints spread over 10 mn and write 30 to 50% of shared buffers.
live data set fit in RAM.
constant load.

On startup (1 or 2/hour) applications were running requests on cold
data which were now saturating IO.
I'm not sure it's an OS bug as the WAL were 'hotter' than the cold data.

A cron task every minute with vmtouch -e for evicting old WAL files
from memory has solved the issue.

Regards


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

2014-09-09 Thread Kyotaro HORIGUCHI
Sorry for wrong suggestion. Setting in_quote is wrong there because it's
before the beginning quote. Although, advancing read pointer and replacing
c with the next value is still needed.

regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
2014/09/09 20:49 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp:

 Hello, I will be the reviewer of this patch.

 You approach that coloring tokens seems right, but you have
 broken the parse logic by adding your code.

 Other than the mistakes others pointed, I found that

 - non-SQL-ident like tokens are ignored by their token style,
   quoted or not, so the following line works.

 | local All aLL trust

 I suppose this is not what you intended. This is because you have
 igonred the attribute of a token when comparing it as
 non-SQL-ident tokens.


 - '+' at the head of the sequence '+' is treated as the first
   character of the *quoted* string. e.g. +hoge is tokenized as
   +hoge:special_quoted.

 This is why you simply continued processing for '+' without
 discarding and skipping the '+', and not setting in_quote so the
 following parser code works as it is not intended. You should
 understand what the original code does and insert or modify
 logics not braeking the assumptions.

   With this patch, database (and role?) names are compared
 case-insensitively.
   For example:
  
   local  MixedDB all trust
   local  mixedDB all reject
  
   psql -d mixedDB
   psql (9.5devel)
   Type help for help.
  
   mixedDB=#
  
   That connection should've matched that 2nd line, and be rejected.
 
  Actually it should have matched neither, as both lines will get folded
  downcase:
 
  local  mixeddb all trust
  local  mixeddb all reject

 regards,

 --
 Kyotaro Horiguchi
 NTT Open Source Software Center


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] pgbench throttling latency limit

2014-09-09 Thread Fabien COELHO


Hello Heikki,


I think we have to reconsider what we're reporting in 9.4, when --rate
is enabled, even though it's already very late in the release cycle.
It's a bad idea to change the definition of latency between 9.4 and 9.5,
so let's get it right in 9.4.


Indeed.

As per the attached patch. I think we should commit this to 9.4. Any 
objections?


Ok for me. Some more propositions about the doc below.


The text this patch adds to the documentation needs some rewording, though.


Probably. Not sure how to improve.


As does this existing paragraph:


High rate limit schedule lag values, that is lag values that are
large compared to the actual transaction latency, indicate that
something is amiss in the throttling process. High schedule lag can
highlight a subtle problem there even if the target rate limit is met
in the end.



One possible cause of schedule lag is insufficient
pgbench threads to handle all of the clients. To improve that,
consider reducing the number of clients, increasing the number of
threads in pgbench, or running pgbench on a separate host. Another
possibility is that the database is not keeping up with the load at
some point. When that happens, you will have to reduce the expected
transaction rate to lower schedule lag.


It took me ages to parse high rate limit schedule lag values.


Indeed, I'm not proud of that one... Moreover the first sentence becomes 
false with the new latency computation, as the lag time is included.


I would suggest:

When under throttling, the reported lag time measures the delay between 
the scheduled start time for the transaction and its actual start time. A 
high value, where the lag time represent most of the transaction latency, 
may indicate that something is amiss in the throttling process itself, 
even if the target rate is met in the end. One possible cause ...


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-09 Thread Fujii Masao
On Tue, Sep 9, 2014 at 1:28 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Sun, Aug 17, 2014 at 7:46 PM, Fujii Masao masao.fu...@gmail.com wrote:


 Thanks for reviewing the patch! ISTM that I failed to make the patch from
 my git repository... Attached is the rebased version.



 I get some compiler warnings on v2 of this patch:

 reloptions.c:219: warning: excess elements in struct initializer
 reloptions.c:219: warning: (near initialization for 'intRelOpts[15]')

Thanks for testing the patch!
Attached is the updated version of the patch.

Previously the patch depended on another infrastructure patch
(which allows a user to specify the unit in reloption (*1)). But that
infrastructure patch has serious problem and it's not easy to fix
the problem. So I changed the patch so that it doesn't depend on
that infrastructure patch at all. Even without the infrastructure
patch, the feature that this patch introduces is useful.

Also I added the regression test into the patch.

(*1)
http://www.postgresql.org/message-id/CAHGQGwEanQ_e8WLHL25=bm_8z5zkyzw0k0yir+kdmv2hgne...@mail.gmail.com

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/gin.sgml
--- b/doc/src/sgml/gin.sgml
***
*** 728,735 
 from the indexed item). As of productnamePostgreSQL/productname 8.4,
 acronymGIN/ is capable of postponing much of this work by inserting
 new tuples into a temporary, unsorted list of pending entries.
!When the table is vacuumed, or if the pending list becomes too large
!(larger than xref linkend=guc-work-mem), the entries are moved to the
 main acronymGIN/acronym data structure using the same bulk insert
 techniques used during initial index creation.  This greatly improves
 acronymGIN/acronym index update speed, even counting the additional
--- 728,736 
 from the indexed item). As of productnamePostgreSQL/productname 8.4,
 acronymGIN/ is capable of postponing much of this work by inserting
 new tuples into a temporary, unsorted list of pending entries.
!When the table is vacuumed, or if the pending list becomes larger than
!literalPENDING_LIST_CLEANUP_SIZE/literal (or
!xref linkend=guc-work-mem if not set), the entries are moved to the
 main acronymGIN/acronym data structure using the same bulk insert
 techniques used during initial index creation.  This greatly improves
 acronymGIN/acronym index update speed, even counting the additional
***
*** 812,829 
/varlistentry
  
varlistentry
!termxref linkend=guc-work-mem/term
 listitem
  para
   During a series of insertions into an existing acronymGIN/acronym
   index that has literalFASTUPDATE/ enabled, the system will clean up
   the pending-entry list whenever the list grows larger than
!  varnamework_mem/.  To avoid fluctuations in observed response time,
!  it's desirable to have pending-list cleanup occur in the background
!  (i.e., via autovacuum).  Foreground cleanup operations can be avoided by
!  increasing varnamework_mem/ or making autovacuum more aggressive.
!  However, enlarging varnamework_mem/ means that if a foreground
!  cleanup does occur, it will take even longer.
  /para
 /listitem
/varlistentry
--- 813,839 
/varlistentry
  
varlistentry
!termliteralPENDING_LIST_CLEANUP_SIZE/ and
!xref linkend=guc-work-mem/term
 listitem
  para
   During a series of insertions into an existing acronymGIN/acronym
   index that has literalFASTUPDATE/ enabled, the system will clean up
   the pending-entry list whenever the list grows larger than
!  literalPENDING_LIST_CLEANUP_SIZE/ (if not set, varnamework_mem/
!  is used as that threshold, instead). To avoid fluctuations in observed
!  response time, it's desirable to have pending-list cleanup occur in the
!  background (i.e., via autovacuum).  Foreground cleanup operations
!  can be avoided by increasing literalPENDING_LIST_CLEANUP_SIZE/
!  (or varnamework_mem/) or making autovacuum more aggressive.
!  However, enlarging the threshold of the cleanup operation means that
!  if a foreground cleanup does occur, it will take even longer.
! /para
! para
!  literalPENDING_LIST_CLEANUP_SIZE/ is an index storage parameter,
!  and allows each GIN index to have its own cleanup threshold.
!  For example, it's possible to increase the threshold only for the GIN
!  index which can be updated heavily, and decrease it otherwise.
  /para
 /listitem
/varlistentry
*** a/doc/src/sgml/ref/create_index.sgml
--- b/doc/src/sgml/ref/create_index.sgml
***
*** 356,361  CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ replaceable class=parametername/
--- 356,377 
  /listitem
 /varlistentry
 /variablelist
+variablelist
+varlistentry
+ termliteralPENDING_LIST_CLEANUP_SIZE//term
+ listitem
+ para
+  This setting specifies the maximum size 

Re: [HACKERS] posix_fadvise() and pg_receivexlog

2014-09-09 Thread Fujii Masao
On Tue, Sep 9, 2014 at 9:07 PM, didier did...@gmail.com wrote:
 Hi

 Well, I'd like to hear someone from the field complaining that
 pg_receivexlog is thrashing the cache and thus reducing the performance of
 some other process. Or a least a synthetic test case that demonstrates that
 happening.
 It's not with pg_receivexlog but it's related.

 On a small box without replication server connected perfs were good
 enough but not so with a replication server connected, there was 1GB
 worth of WAL sitting in RAM vs next to nothing without slave!

After WAL file is filled up and closed, it will not be re-read
if wal_level is set to minimal (i.e., neither archiving nor
replication is enabled). So, in this case, PostgreSQL advises the OS
to release any cached pages of that WAL file. But not if archiving
or replication is enabled, and then WAL file keeps being cached
even after it's closed. Probably this is the cause of what you
observed, I guess.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-09-09 Thread Robert Haas
On Thu, Aug 21, 2014 at 11:01 AM, Andrew Gierth
and...@tao11.riddles.org.uk wrote:
 Heikki == Heikki Linnakangas hlinnakan...@vmware.com writes:
  Heikki Uh, that's ugly. The EXPLAIN out I mean; as an implementation
  Heikki detail chaining the nodes might be reasonable. But the above
  Heikki gets unreadable if you have more than a few grouping sets.

 It's good for highlighting performance issues in EXPLAIN, too.

Perhaps so, but that doesn't take away from Heikki's point: it's still
ugly.  I don't understand why the sorts can't all be nested under the
GroupAggregate nodes.  We have a number of nodes already (e.g. Append)
that support an arbitrary number of children, and I don't see why we
can't do the same thing here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Memory Alignment in Postgres

2014-09-09 Thread Arthur Silva
I'm continuously studying Postgres codebase. Hopefully I'll be able to make
some contributions in the future.

For now I'm intrigued about the extensive use of memory alignment. I'm sure
there's some legacy and some architecture that requires it reasoning behind
it.

That aside, since it wastes space (a lot of space in some cases) there must
be a tipping point somewhere. I'm sure one can prove aligned access is
faster in a micro-benchmark but I'm not sure it's the case in a DBMS like
postgres, specially in the page/rows area.

Just for the sake of comparison Mysql COMPACT storage (default and
recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed
4-byte alignment. Not sure about Oracle and others.

Is it worth the extra space in newer architectures (specially Intel)?
Do you guys think this is something worth looking at?

I'm trying to messing with the *ALIGN macros but so far I wasn't able to
get any conclusive results. My guess is that I'm missing something in the
code or pg_bench doesn't stress the difference enough.

--
Arthur Silva


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-09 Thread Robert Haas
On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra t...@fuzzy.cz wrote:
 So I only posted the separate patch for those who want to do a review,
 and then a complete patch with both parts combined. But it sure may be
 a bit confusing.

Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.

 In ExecChooseHashTableSize(), if buckets_bytes is independent of
 nbatch, could we just compute it before working out dbatch, and just
 deduct it from hash_table_bytes? That seems like it'd do the same
 thing for less work.

 Well, we can do that. But I really don't think that's going to make
 measurable difference. And I think the code is more clear this way.

Really?  It seems to me that you're doing more or less the same
calculation that's already been done a second time.  It took me 20
minutes of staring to figure out what it was really doing.  Now
admittedly I was a little low on caffeine, but...

 Either way, what happens if buckets_bytes is already bigger than
 hash_table_bytes?

 Hm, I don't see how that could happen.

How about an Assert() and a comment, then?

 A few more stylistic issues that I see:

 +   if ((hashtable-nbatch == 1) 
 +   if (hashtable-nbuckets_optimal = (INT_MAX/2))
 +   if (size  (HASH_CHUNK_SIZE/8))

 While I'm all in favor of using parentheses generously where it's
 needed to add clarity, these and similar usages seem over the top to
 me.

 It seemed more readable for me at the time of coding it, and it still
 seems better this way, but ...

 https://www.youtube.com/watch?v=CxK_nA2iVXw

Heh.  Well, at any rate, I think PostgreSQL style is to not use
parentheses in that situation.

 +char * chunk_alloc(HashJoinTable hashtable, int size) {

 Eh... no.

 +   /* XXX maybe we should use MAXALIGN(size) here ... */

 +1.

 I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
 should be on the next line? And is the +1 about doing MAXALING?

The no is about the fact that you have the return type, the function
name, and the opening brace on one line instead of three lines as is
project style.  The +1 is for applying MAXALIGN to the size.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2014-09-09 Thread Albe Laurenz
Etsuro Fujita wrote:
 I agree with you on that point.  So, I've updated the patch to have the
 explicit flag, as you proposed.  Attached is the updated version of the
 patch.  In this version, I've also revised code and its comments a bit.

Thank you, I have set the patch to Ready for Committer.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Join push-down support for foreign tables

2014-09-09 Thread Robert Haas
On Sun, Sep 7, 2014 at 7:07 PM, Shigeru HANADA shigeru.han...@gmail.com wrote:
 I think it's probably good to give an FDW the option of producing a
 ForeignJoinPath for any join against a ForeignPath *or
 ForeignJoinPath* for the same FDW.  It's perhaps unlikely that an FDW
 can perform a join efficiently between two data sources with different
 server definitions, but why not give it the option?  It should be
 pretty fast for the FDW to realize, oh, the server OIDs don't match -
 and at that point it can exit without doing anything further if that
 seems desirable.  And there might be some kinds of data sources where
 cross-server joins actually can be executed quickly (e.g. when the
 underlying data is just in two files in different places on the local
 machine).

 Indeed how to separate servers is left to users, or author of FDWs, though
 postgres_fdw and most of other FDWs can join foreign tables in a server.  I
 think it would be good if we can know two foreign tables are managed by same
 FDW, from FdwRoutine, maybe adding new API which returns FDW identifier?

Do we need this?  I mean, if you get the FdwRoutine, don't you have
the OID of the FDW or the foreign table also?

 I think so too, so ForeignJoinPath should be able to be an input of another
 ForeignJoinPath in upper join level.  But I also think joining on remote or
 not should be decided based on cost, as existing joins are planned with
 bottom-up approach.

Definitely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-09 Thread Robert Haas
On Wed, Aug 6, 2014 at 3:22 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Maybe we ought to break down and support a real expression syntax.
 Sounds like that would be better all around.

 Adding operators is more or less orthogonal with providing a new expression
 syntax. I'm not sure that there is currently a crying need for it (a
 syntax). It would be a significant project. It would raise the question
 where to stop... And I just really need a few more functions/operators
 which can be fitted in the current implementation quite easily.

Writing a simple expression parser for pgbench using flex and bison
would not be an inordinately difficult problem.  And it would let you
add abs() and ?: and thereby avoid the need to hard-code multiple
versions of the modulo operator in the patch.  The fact that you're
agonizing about which modulo to add is a sign that the language is too
impoverished to let you do anything non-trivial.  That's why C only
has one modulo operator: as the patch demonstrates, the differences
between the version can be patched over with a very short C
expression.  If we had real expressions in pgbench, you could do the
same thing there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Alignment in Postgres

2014-09-09 Thread Bruce Momjian
On Tue, Sep  9, 2014 at 11:08:05AM -0300, Arthur Silva wrote:
 I'm continuously studying Postgres codebase. Hopefully I'll be able to make
 some contributions in the future.
 
 For now I'm intrigued about the extensive use of memory alignment. I'm sure
 there's some legacy and some architecture that requires it reasoning behind 
 it.
 
 That aside, since it wastes space (a lot of space in some cases) there must be
 a tipping point somewhere. I'm sure one can prove aligned access is faster in 
 a
 micro-benchmark but I'm not sure it's the case in a DBMS like postgres,
 specially in the page/rows area.
 
 Just for the sake of comparison Mysql COMPACT storage (default and recommended
 since 5.5) doesn't align data at all. Mysql NDB uses a fixed 4-byte alignment.
 Not sure about Oracle and others.
 
 Is it worth the extra space in newer architectures (specially Intel)?
 Do you guys think this is something worth looking at?
 
 I'm trying to messing with the *ALIGN macros but so far I wasn't able to get
 any conclusive results. My guess is that I'm missing something in the code or
 pg_bench doesn't stress the difference enough.

Postgres reads data block from disk and puts them in shared memory, then
the CPU accesses those values, like floats and integers, as though they
were in allocated memory, i.e. we make no adjustments to the data from
disk all the way to CPU.

I don't think anyone has measured the overhead of doing less alignment,
but I would be interested to see any test results produced.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and epoch

2014-09-09 Thread Bruce Momjian
On Sat, Sep  6, 2014 at 09:30:06AM -0400, Bruce Momjian wrote:
 On Fri, Sep  5, 2014 at 07:35:42PM -0400, Bruce Momjian wrote:
  On Sat, Sep  6, 2014 at 12:26:55AM +0100, Greg Stark wrote:
   On Wed, Sep 3, 2014 at 3:59 AM, Bruce Momjian br...@momjian.us wrote:
I have developed the attached patch which causes pg_upgrade to preserve
the transaction epoch.  I plan to apply this for PG 9.5.
   
   I would say this is a simple bug and should be back patched to 9.4 and
   9.3. We're only going to continue to get complaints from people
   running into this.
  
  Yes, I did think about that, but it seems like a behavior change. 
  However, it is tempting to avoid future bug reports about this.
 
 When this came up in March, Tom and I agreed that this wasn't something
 we wanted to slip into 9.4.  Given that, it is hard to argue we should
 now slip this into 9.5, 9.4, and 9.3, so unless someone else votes for
 inclusion, I think I will leave this as 9.5-only.

With no one replying, I will consider this issue closed and not
backpatch this.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLESPACE MOVE command tag tweak

2014-09-09 Thread Stephen Frost
Alvaro,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 Thanks.  One more tweak --- the whole reason for fiddling with this is
 to ensure that event triggers support this operation.  Therefore this
 node should be handled by ProcessUtilitySlow, not
 standard_ProcessUtility, as in the attached patch.

[...]

 I propose this for 9.4 too.

Done.

 I bet they have!  Have fun,

Thanks!  I'm trying to. :)

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-09 Thread Fabien COELHO


Hello Robert,

Writing a simple expression parser for pgbench using flex and bison 
would not be an inordinately difficult problem.


Sure. Note that there is not only the parsing but also the evaluating to 
think of, which mean a data structure to represent the expressions which 
would be more complex than the current one. Although it is not difficult 
as such, it would mean breaking pgbench heavily, which would mean 
more time and energy than I have available.


And it would let you add abs() and ?: and thereby avoid the need to 
hard-code multiple versions of the modulo operator in the patch.


It would also mean to *implement* abs and ?: in the execution code to 
compute the parsed expression.


The fact that you're agonizing about which modulo to add is a sign that 
the language is too impoverished to let you do anything non-trivial.


I'm not agonizing about which modulo to use:-) I know I do not want the 
C/SQL version which is never really useful if you have signed data. I 
overlooked this detail when submitting the first patch, and produced a 
stupid patch with all 3 possible modulos, before going to the sane 
solution which is to use the floored division Knuth version. If I had 
thought a bit, I would have sent v3 as v1 directly.


That's why C only has one modulo operator: as the patch demonstrates, 
the differences between the version can be patched over with a very 
short C expression. If we had real expressions in pgbench, you could do 
the same thing there.


Sure. I agree that pgbench is limited and that real expressions would have 
helped, but it is also quite useful and easy to extend as is. Maybe the 
add an expression parser to pgbench could be added in the postgresql 
todo wiki?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-09-09 Thread Robert Haas
On Tue, Sep 9, 2014 at 11:19 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2014-09-09 16:01 GMT+02:00 Robert Haas robertmh...@gmail.com:
 On Thu, Aug 21, 2014 at 11:01 AM, Andrew Gierth
 and...@tao11.riddles.org.uk wrote:
  Heikki == Heikki Linnakangas hlinnakan...@vmware.com writes:
   Heikki Uh, that's ugly. The EXPLAIN out I mean; as an implementation
   Heikki detail chaining the nodes might be reasonable. But the above
   Heikki gets unreadable if you have more than a few grouping sets.
 
  It's good for highlighting performance issues in EXPLAIN, too.

 Perhaps so, but that doesn't take away from Heikki's point: it's still
 ugly.  I don't understand why the sorts can't all be nested under the
 GroupAggregate nodes.  We have a number of nodes already (e.g. Append)
 that support an arbitrary number of children, and I don't see why we
 can't do the same thing here.

 I don't think so showing sort and aggregation is bad idea. Both can have a
 different performance impacts

Sure, showing the sort and aggregation steps is fine.  But I don't see
what advantage we get out of showing them like this:

Aggregate
- Sort
  - ChainAggregate
- Sort
  - ChainAggregate
 - Sort

When we could show them like this:

Aggregate
- Sort
- Sort
- Sort

From both a display perspective and an implementation-complexity
perspective, it seems appealing to have the Aggregate node feed the
data to one sort after another, rather having it send the data down a
very deep pipe.

I might be missing something, of course.  I don't want to presume that
I'm smarter than Andrew, because Andrew is pretty smart.  :-)  But it
seems odd to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-09-09 Thread Pavel Stehule
2014-09-09 16:01 GMT+02:00 Robert Haas robertmh...@gmail.com:

 On Thu, Aug 21, 2014 at 11:01 AM, Andrew Gierth
 and...@tao11.riddles.org.uk wrote:
  Heikki == Heikki Linnakangas hlinnakan...@vmware.com writes:
   Heikki Uh, that's ugly. The EXPLAIN out I mean; as an implementation
   Heikki detail chaining the nodes might be reasonable. But the above
   Heikki gets unreadable if you have more than a few grouping sets.
 
  It's good for highlighting performance issues in EXPLAIN, too.

 Perhaps so, but that doesn't take away from Heikki's point: it's still
 ugly.  I don't understand why the sorts can't all be nested under the
 GroupAggregate nodes.  We have a number of nodes already (e.g. Append)
 that support an arbitrary number of children, and I don't see why we
 can't do the same thing here.


I don't think so showing sort and aggregation is bad idea. Both can have a
different performance impacts





 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



[HACKERS] Problems with config.php and non default ports (postgresql - sugarcrm)

2014-09-09 Thread Bianca Santana Espichicoquez
Hello, I've a problem, we're using sugarcrm, and we have a database
postgresql, but not in the default port, so, when I try to connect after I
put the port in the db_port parameter, but seems like he not recognized,
because still pointing to the other instance of the database that have the
5432 port, and I can see in that log when sugar it's trying to connect,
even when I change the port.

Postgresql 9.3
Sugarcrm 6.5.0

If i put in the db_host_name parameter: ip:port not work either. So, I need
to know if exist another file where need to put the port value?

I'll appreciate your help. Thanks.


Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-09 Thread Pavel Stehule
2014-09-09 7:55 GMT+02:00 Craig Ringer cr...@2ndquadrant.com:

 On 09/09/2014 05:20 AM, Robert Haas wrote:
 
  I previously proposed RAISE ASSERT ... WHERE, which seems like a nice
  variant of what we've already got, but perhaps this whole discussion
  merely illustrates that it's hard to get more than 1 vote for any
  proposal in this area.

 Well, you have Petr's for RAISE EXCEPTION ... WHEN, and I'd also like
 that or RAISE ASSERT ... WHEN.


Ada is language with strong character, and PLpgSQL is little bit strange
fork - so it isn't easy to find some form, how to solve all requirements.

My requests:

* be consistent with current PLpgSQL syntax and logic
* allow some future extensibility
* allow a static analyses without hard expression processing

But I am thinking so there are some points where can be some agreement -
although it is not ASSERT implementation.

enhancing RAISE WHEN - please, see attached patch -

I prefer RAISE WHEN again RAISE WHERE due consistency with EXIT and
CONTINUE [ WHEN ];

Next we can reserve some SQLCODE for assertation and we can implement it as
not handled exception. It is only cancel now, and it is not usable .
Probably it should be implement on SQL level - not on plpgsql only.

Regards

Pavel









 Much (much) saner than the other proposals on this thread IMO.

 --
  Craig Ringer   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services

commit 77ff7203d83e889aa9f5190fe54b04dc7c26a5aa
Author: Pavel Stehule pavel.steh...@gooddata.com
Date:   Tue Sep 9 12:11:15 2014 +0200

initial

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 11cb47b..f8ac200 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -2890,8 +2890,20 @@ exec_stmt_raise(PLpgSQL_execstate *estate, PLpgSQL_stmt_raise *stmt)
 	char	   *err_datatype = NULL;
 	char	   *err_table = NULL;
 	char	   *err_schema = NULL;
+	
 	ListCell   *lc;
 
+	if (stmt-cond != NULL)
+	{
+		bool	value;
+		bool	isnull;
+
+		value = exec_eval_boolean(estate, stmt-cond, isnull);
+		exec_eval_cleanup(estate);
+		if (isnull || value == false)
+			return PLPGSQL_RC_OK;
+	}
+
 	/* RAISE with no parameters: re-throw current exception */
 	if (stmt-condname == NULL  stmt-message == NULL 
 		stmt-options == NIL)
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 893f3a4..1f0b861 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -63,6 +63,7 @@ static	void			current_token_is_not_variable(int tok);
 static	PLpgSQL_expr	*read_sql_construct(int until,
 			int until2,
 			int until3,
+			int until4,
 			const char *expected,
 			const char *sqlstart,
 			bool isexpression,
@@ -105,7 +106,7 @@ static	void			 check_labels(const char *start_label,
 	  int end_location);
 static	PLpgSQL_expr	*read_cursor_args(PLpgSQL_var *cursor,
 		  int until, const char *expected);
-static	List			*read_raise_options(void);
+static	List			*read_raise_options(int *endtok);
 static	void			check_raise_parameters(PLpgSQL_stmt_raise *stmt);
 
 %}
@@ -1422,6 +1423,7 @@ for_control		: for_variable K_IN
 			expr1 = read_sql_construct(DOT_DOT,
 	   K_LOOP,
 	   0,
+	   0,
 	   LOOP,
 	   SELECT ,
 	   true,
@@ -1716,6 +1718,7 @@ stmt_raise		: K_RAISE
 	{
 		PLpgSQL_stmt_raise		*new;
 		int	tok;
+		int	endtok;
 
 		new = palloc(sizeof(PLpgSQL_stmt_raise));
 
@@ -1726,6 +1729,7 @@ stmt_raise		: K_RAISE
 		new-message	= NULL;
 		new-params		= NIL;
 		new-options	= NIL;
+		new-cond = NULL;
 
 		tok = yylex();
 		if (tok == 0)
@@ -1796,22 +1800,22 @@ stmt_raise		: K_RAISE
  * or USING to begin the options list.
  */
 tok = yylex();
-if (tok != ','  tok != ';'  tok != K_USING)
+if (tok != ','  tok != ';'  tok != K_USING  tok != K_WHEN)
 	yyerror(syntax error);
 
 while (tok == ',')
 {
 	PLpgSQL_expr *expr;
 
-	expr = read_sql_construct(',', ';', K_USING,
-			  , or ; or USING,
+	expr = read_sql_construct(',', ';', K_USING, K_WHEN,
+			  , or ; or USING or WHEN,
 			  SELECT ,
 			  true, true, true,
 			  NULL, tok);
 	new-params = lappend(new-params, expr);
 }
 			}
-			else if (tok != K_USING)
+			else if (tok != K_USING  tok != K_WHEN)
 			{
 /* must be condition name or SQLSTATE */
 if (tok_is_keyword(tok, yylval,
@@ -1847,7 +1851,13 @@ stmt_raise		: K_RAISE
 			}
 
 			if (tok == K_USING)
-new-options = read_raise_options();
+			{
+new-options = read_raise_options(endtok);
+if (endtok == K_WHEN)
+	new-cond = read_sql_expression(';', ;);
+			}
+			if (tok == K_WHEN)
+new-cond = 

Re: [HACKERS] [PATCH] empty xml values

2014-09-09 Thread Peter Eisentraut
On 8/30/14 12:43 PM, Ali Akbar wrote:
 While looking into this report
 http://www.postgresql.org/message-id/cf48ccfb.65a9d%tim.k...@gmail.com I
 noticed that we don't accept empty values as xml content values, even
 though this should apparently be allowed by the spec.  Attached is a
 patch to fix it (which needs updates to xml_1.out, which I omit here for
 simplicity).
 
 
 The patch works, albeit must be applied with --ignore-whitespace.
 Attached the patch + xml_1.out updates.
 
 I'm marking this ready for commit

Committed, thanks.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-09 Thread Robert Haas
On Tue, Sep 9, 2014 at 11:07 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 The fact that you're agonizing about which modulo to add is a sign that
 the language is too impoverished to let you do anything non-trivial.

 I'm not agonizing about which modulo to use:-) I know I do not want the
 C/SQL version which is never really useful if you have signed data. I
 overlooked this detail when submitting the first patch, and produced a
 stupid patch with all 3 possible modulos, before going to the sane solution
 which is to use the floored division Knuth version. If I had thought a bit,
 I would have sent v3 as v1 directly.

Sure, and I would have looked at that patch and complained that you
were implementing a modulo operator with different semantics than C.
And then we'd be right back where we are now.

The problem here isn't that I don't know what you want, or that what
you want is unreasonable.  The problem is that we can't cater to every
slightly-different thing that somebody wants to do with pgbench.  If
we try, we'll end up with a neverending jumble of features most of
which never get used by 1 or 2 people.  Features for any part of
PostgreSQL need to be of reasonably general interest, not something
that is super-specialized to one particular set of needs.  If I commit
what you're asking me to commit here, I think that's exactly what I'll
be doing, and I don't think that's a good idea.

In all seriousness, sometimes the right solution to these problems is
just to patch your own copy.  I've done that with pgbench at least
once and with PostgreSQL in general more times than I can conveniently
count.  I've learned a lot of useful things that way, but I can't
expect all of the debugging instrumentation and trial features that
I've created to get incorporated into the product.  It's not
reasonable, and it's not worth the time it would take me to make the
code general and maintainable, so the code just gets dropped on the
floor.  In a perfect world, that wouldn't happen, but in a perfect
world they'd pay me the same salary they pay Linus Torvalds.

In this particular instance, we got onto this topic in the first place
because of the Gaussian and exponential patches, and the desire to
have the hot portion of the keys distributed in some random way
through the data set rather than having everything be clustered at the
front.  As you yourself said, the best case for this patch is to allow
a poor-man's approximation of that.  Adding a weird non-standard
operator for a poor approximation of the feature we're really looking
for just doesn't feel like the right call.  I recognize that you have
a different view, and if enough people agree with you, that argument
may win the day.  But my opinion is that we are too far down in the
weeds.  We should be trying to create features that will have general
appeal to pgbench users, not features that solve narrow problems for
individual developers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-09-09 Thread Andrew Gierth
 Robert == Robert Haas robertmh...@gmail.com writes:

 Robert Sure, showing the sort and aggregation steps is fine.  But I
 Robert don't see what advantage we get out of showing them like
 Robert this:

 Robert Aggregate
 Robert - Sort
 Robert- ChainAggregate
 Robert   - Sort
 Robert  - ChainAggregate
 Robert - Sort

The advantage is that this is how the plan tree is actually
structured.

 Robert When we could show them like this:

 Robert Aggregate
 Robert  - Sort
 Robert  - Sort
 Robert  - Sort

And we can't structure the plan tree like this, because then it
wouldn't be a _tree_ any more.

The Sort node expects to have a child node to fetch rows from, and it
expects all the usual plan tree mechanics (initialization, rescan,
etc.) to work on that child node. There's no way for the parent to
feed data to the child.

 Robert From both a display perspective and an
 Robert implementation-complexity perspective,

... says the person who has never tried implementing it.

Honestly, ChainAggregate is _trivial_ compared to trying to make the
GroupAggregate code deal with multiple inputs, or trying to make some
new sort of plumbing node to feed input to those sorts.  (You'd think
that it should be possible to use the existing CTE mechanics to do it,
but n... the existing code is actively and ferociously hostile to
the idea of adding new CTEs from within the planner.)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Problems with config.php and non default ports (postgresql - sugarcrm)

2014-09-09 Thread Andrew Dunstan


On 09/08/2014 05:27 PM, Bianca Santana Espichicoquez wrote:
Hello, I've a problem, we're using sugarcrm, and we have a database 
postgresql, but not in the default port, so, when I try to connect 
after I put the port in the db_port parameter, but seems like he not 
recognized, because still pointing to the other instance of the 
database that have the 5432 port, and I can see in that log when sugar 
it's trying to connect, even when I change the port.


Postgresql 9.3
Sugarcrm 6.5.0

If i put in the db_host_name parameter: ip:port not work either. So, I 
need to know if exist another file where need to put the port value?


I'll appreciate your help. Thanks.



Please post on the correct list. Neither pgsql-advocacy nor 
pgsql-hackers is the correct list. Either a Sugarcrm list or 
pgsql-general might be the correct forum.


Also, cross-posting on more than one postgresql list is generally 
frowned on.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PQputCopyEnd doesn't adhere to its API contract

2014-09-09 Thread Robert Haas
On Mon, Sep 8, 2014 at 6:20 PM, David G Johnston
david.g.johns...@gmail.com wrote:
 On Mon, Sep 8, 2014 at 11:45 AM, Robert Haas [via PostgreSQL] [hidden
 email] wrote:

 On Thu, Sep 4, 2014 at 6:38 PM, David Johnston
 [hidden email] wrote:

  One of the trade-offs I mentioned...its more style than anything but
  removing the parenthetical (if there is not error in the command) and
  writing it more directly seemed preferable in an overview such as this.
 
  Better:  The function will either throw an error or return a PGresult
  object[...]

 Nope.  This is not C++, nor is it the backend.  It will not throw
 anything.

 The current sentence reads:
 The response to this (if there is no error in the command) will be a
 PGresult object bearing a status code of PGRES_COPY_OUT or PGRES_COPY_IN
 (depending on the specified copy direction).

 Maybe this is taken for granted by others, and thus can be excluded here,
 but I'm trying to specify what happens if there is an error in the
 command...  Apparently one does not get back a PGresult object...

Well, there's the point of confusion, because the function ALWAYS
returns a PGresult, except maybe in some obscure corner case where it
can't allocate one and therefore returns NULL.  What differs is the
result status.

 Ignoring style and such did anything I wrote help to clarify your
 understanding of why pgPutCopyEnd does what it does?  As I say this and
 start to try and read the C code I think that it is a good source for
 understanding novice assumptions but there is a gap between the docs and the
 code - though I'm not sure you've identified the only (or even proper)
 location.

Honestly, not really.  I thought the language I previously discussed
with Tom was adequate for that; and I'm a little confused as to why
we've gotten on what seems to be somewhat of a digression into nearby
but otherwise-unrelated documentation issues.

 Unlike PQputCopyData there is no explicit logic (pqIsnonblocking(conn) in
 PQputCopyEnd for non-blocking mode (there is an implicit one via pqFlush).
 This does seem like an oversight - if a minor one since the likihood of not
 being able to add the EOF to the connection's buffer seems highly unlikely -
 but I would expect on the basis of symmetry alone - for both of them to have
 buffer filled testing logic.  And, depending on how large *errormsg is, the
 risk of being unable to place the data in the buffer isn't as small and the
 expected EOF case.

Yeah, I think that's a bug in PQputCopyEnd().  It should have the same
kind of pqCheckOutBufferSpace() check that PQputCopyData() does.

 I'm getting way beyond my knowledge level here but my assumption from the
 documentation was that the async mode behavior of returning zero revolved
 around retrying in order to place the data into the buffer - not retrying to
 make sure the buffer is empty.  The caller deals with that on their own
 based upon their blocking mode decision.  Though we would want to call
 pqFlush for blocking mode callers here since this function should block
 until the buffer is clear.

PQputCopyEnd() already does flush; and PQputCopyData() shouldn't flush.

 So, I thought I agreed with your suggestion that if the final pqFlush
 returns a 1 that pqPutCopyEnd should return 0.

Well, Tom set me straight on that; so I don't think we're considering
any such change.  I think you need to make sure you understand the
previous discussion in detail before proposing how to adjust the
documentation (or the code).

 Additionally, if in
 non-blocking mode, and especially if *errormsg is not NULL, the available
 connection buffer length logic in pqPutCopyData should be evaluated here as
 well.

Yes.  See the comment three up from this one.

 The most useful and compatible solution is to make pqPutCopyEnd synchronous
 regardless of the user selected blocking mode - which it mostly is now but
 the final pqFlush should be in a loop - and adjust the documentation in the
 areas noted in my patch-email to accommodate that fact.

Uh, no.  If you do that, you'll break the code that I spent so much
time writing, which led to my original post.  I wouldn't be using
non-blocking mode if it were OK for it to block.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-09-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Sep 9, 2014 at 12:01 PM, Andrew Gierth
 and...@tao11.riddles.org.uk wrote:
 Honestly, ChainAggregate is _trivial_ compared to trying to make the
 GroupAggregate code deal with multiple inputs, or trying to make some
 new sort of plumbing node to feed input to those sorts.  (You'd think
 that it should be possible to use the existing CTE mechanics to do it,
 but n... the existing code is actively and ferociously hostile to
 the idea of adding new CTEs from within the planner.)

 That's unfortunate.

I'm less than convinced that it's true ... I've been meaning to find
time to review this patch, but it sounds like it's getting to the point
where I need to.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-09-09 Thread Andres Freund
On 2014-08-28 13:54:28 +0200, Andres Freund wrote:
 On 2014-08-28 13:20:07 +0200, Andres Freund wrote:
  I've attached a incremental patch.
 
 Apparently I didn't attach the patch, as so much a file containing the
 name of the patchfile...

Which you obviously didn't integrate. And didn't comment on. Grr.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-09-09 Thread Robert Haas
On Tue, Sep 9, 2014 at 12:01 PM, Andrew Gierth
and...@tao11.riddles.org.uk wrote:
 Robert == Robert Haas robertmh...@gmail.com writes:

  Robert Sure, showing the sort and aggregation steps is fine.  But I
  Robert don't see what advantage we get out of showing them like
  Robert this:

  Robert Aggregate
  Robert - Sort
  Robert- ChainAggregate
  Robert   - Sort
  Robert  - ChainAggregate
  Robert - Sort

 The advantage is that this is how the plan tree is actually
 structured.

I do understand that.  I am questioning (as I believe Heikki was also)
whether it's structured correctly.  Nobody is arguing for displaying
the plan tree in a way that doesn't mirror it's actual structure, or
at least I am not.

 The Sort node expects to have a child node to fetch rows from, and it
 expects all the usual plan tree mechanics (initialization, rescan,
 etc.) to work on that child node. There's no way for the parent to
 feed data to the child.

OK, good point.  So we do need something that can feed data from one
part of the plan tree to another, like a CTE does.  I still think it
would be worth trying to see if there's a reasonable way to structure
the plan tree so that it's flatter.

  Robert From both a display perspective and an
  Robert implementation-complexity perspective,

 ... says the person who has never tried implementing it.

This comment to me reads rather sharply, and I don't feel that the
tone of my email is such as to merit a rebuke.

 Honestly, ChainAggregate is _trivial_ compared to trying to make the
 GroupAggregate code deal with multiple inputs, or trying to make some
 new sort of plumbing node to feed input to those sorts.  (You'd think
 that it should be possible to use the existing CTE mechanics to do it,
 but n... the existing code is actively and ferociously hostile to
 the idea of adding new CTEs from within the planner.)

That's unfortunate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-09 Thread Jeff Janes
On Tue, Sep 9, 2014 at 2:57 AM, Heikki Linnakangas hlinnakan...@vmware.com
wrote:


 I agree. If we are to support UPDATE .. ORDER BY .. LIMIT, it should work
 with inheritance. So the path forward is (using Marko's phrasing upthread):

1) We put the LIMIT inside ModifyTable like this patch does.  This
 doesn't prevent us from doing ORDER BY in the future, but helps numerous
 people who today have to
2) Someone rewrites how UPDATE works based on Tom's suggestion here:
 http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us,
 which allows us to support ORDER BY on all tables (or perhaps maybe not
 FDWs, I don't know how those work).  The LIMIT functionality in this
 patch is unaffected.

 What's not clear to me is whether it make sense to do 1) without 2) ? Is
 UPDATE .. LIMIT without support for an ORDER BY useful enough?


I've wanted LIMIT even without ORDER BY many times, so I'd vote yes.



 And if we apply this patch now, how much of it needs to be rewritten after
 2) ? If the answers are yes and not much, then we should review this
 patch now, and put 2) on the TODO list. Otherwise 2) should do done first.


On that I can't give any useful feedback.


Cheers,

Jeff


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-09 Thread Petr Jelinek

On 29/07/14 18:51, Robert Haas wrote:

On Mon, Jul 28, 2014 at 1:50 PM, Andres Freund and...@2ndquadrant.com wrote:

What I'm thinking of is providing an actual API for the writes instead
of hooking into the socket API in a couple places. I.e. have something
like

typedef struct DestIO DestIO;

struct DestIO
{
 void (*flush)(struct DestIO *io);
 int (*putbytes)(struct DestIO *io, const char *s, size_t len);
 int (*getbytes)(struct DestIO *io, const char *s, size_t len);
 ...
}

and do everything through it. I haven't thought much about the specific
API we want, but abstracting the communication properly instead of
adding hooks here and there is imo much more likely to succeed in the
long term.


This sounds suspiciously like the DestReceiver thing we've already
got, except that the DestReceiver only applies to tuple results, not
errors and notices and so on.  I'm not totally unamenable to a bigger
refactoring here, but right now it looks to me like a solution in
search of a problem.  The hooks are simple and seem to work fine; I
don't want to add notation for its own sake.



I am not sure if what Andres proposed is the right solution, but I do 
agree here that the hook definitely isn't.


I am also not sure that I like the pq_redirect_to_shm_mq being directly 
exposed for use in bgworker, what I would like is to have elog interface 
to which you tell that you want errors sent to shm_mq handle and that 
one would then do the necessary calls (It's basically same principle but 
for the sake of cleanliness/correctness I think it should be elog and 
not pq from the point of bgworker).


So the interface needs work.

I do agree that we don't need two way communication over this channel, I 
think the dsm_mq can be used directly quite well for generic usecase.


Otherwise I like the patch, and I am glad you also made the GUC 
save/restore infrastructure.


Please don't forget to add this to next commitfest.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch for GROUPING SETS phase 1

2014-09-09 Thread Andrew Gierth
 Tom == Tom Lane t...@sss.pgh.pa.us writes:

  Honestly, ChainAggregate is _trivial_ compared to trying to make the
  GroupAggregate code deal with multiple inputs, or trying to make some
  new sort of plumbing node to feed input to those sorts.  (You'd think
  that it should be possible to use the existing CTE mechanics to do it,
  but n... the existing code is actively and ferociously hostile to
  the idea of adding new CTEs from within the planner.)

  That's unfortunate.

 Tom I'm less than convinced that it's true ...

Maybe you can figure out how, but I certainly didn't see a reasonable way.

I would also question one aspect of the desirability - using the CTE
mechanism has the downside of needing an extra tuplestore with the
full input data set in it, whereas the chain mechanism only has
aggregated data in its tuplestore which should be much smaller.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-09-09 Thread Petr Jelinek

Hi,

I recently wanted several times to have slave server prepared at certain 
point in time to reduce the time it takes for it to replay remaining 
WALs (say I have pg_basebackup -x on busy db for example).
Currently the way to do it is to have pause_at_recovery_target true 
(default) and wait until pg accepts connection and the shut it down. The 
issue is that this is ugly, and also there is a chance that somebody 
else connects and does bad things (tm) before my process does.


So I wrote simple patch that adds option to shut down the cluster once 
recovery_target is reached. The server will still be able to continue 
WAL replay if needed later or can be configured to start standalone.



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0f1ff34..278bbaa 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -309,6 +309,36 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
  /varlistentry
 
  /variablelist
+
+ varlistentry id=shutdown-at-recovery-target
+   xreflabel=shutdown_at_recovery_target
+  termvarnameshutdown_at_recovery_target/varname (typeboolean/type)
+  indexterm
+primaryvarnameshutdown_at_recovery_target/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+Specifies whether postgres should shutdown the once the recovery target
+is reached. The default is false.
+This is intended to have instance ready at exact replay point desired.
+The instance will still be able to replay more WAL records.
+Note that because reconvery.conf will not be removed if this option is
+set to true, the above means that unless you change your configuration,
+any subsequent start will end with immediate shutdown.
+   /para
+   para
+This setting has no effect if xref linkend=guc-hot-standby is not
+enabled, or if no recovery target is set.
+   /para
+   para
+Setting this to true will set link linkend=pause-at-recovery-target
+varnamepause_at_recovery_target//link to false.
+   /para
+  /listitem
+ /varlistentry
+
+ /variablelist
/sect1
 
   sect1 id=standby-settings
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 34f2fc0..bbe3987 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -222,6 +222,7 @@ static char *archiveCleanupCommand = NULL;
 static RecoveryTargetType recoveryTarget = RECOVERY_TARGET_UNSET;
 static bool recoveryTargetInclusive = true;
 static bool recoveryPauseAtTarget = true;
+static bool recoveryShutdownAtTarget = false;
 static TransactionId recoveryTargetXid;
 static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
@@ -5159,6 +5160,18 @@ readRecoveryCommandFile(void)
 	(errmsg_internal(pause_at_recovery_target = '%s',
 	 item-value)));
 		}
+		else if (strcmp(item-name, shutdown_at_recovery_target) == 0)
+		{
+			if (!parse_bool(item-value, recoveryShutdownAtTarget))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(parameter \%s\ requires a Boolean value, shutdown_at_recovery_target)));
+			ereport(DEBUG2,
+	(errmsg_internal(shutdown_at_recovery_target = '%s',
+	 item-value)));
+			if (recoveryShutdownAtTarget)
+recoveryPauseAtTarget = false;
+		}
 		else if (strcmp(item-name, recovery_target_timeline) == 0)
 		{
 			rtliGiven = true;
@@ -6907,6 +6920,24 @@ StartupXLOG(void)
 ereport(LOG,
 	 (errmsg(last completed transaction was at log time %s,
 			 timestamptz_to_str(xtime;
+
+			/*
+			 * Shutdown here when requested. We need to exit this early because
+			 * we want to be able to continue the WAL replay when started
+			 * the next time.
+			 */
+			if (recoveryShutdownAtTarget  reachedStopPoint)
+			{
+ereport(LOG, (errmsg(shutting down)));
+/*
+ * Note that we exit with status 2 instead of 0 here to force
+ * postmaster shutdown the whole instance. This produces one
+ * unnecessary log line, but we don't have better way to
+ * shutdown postmaster from within single backend.
+ */
+proc_exit(2);
+			}
+
 			InRedo = false;
 		}
 		else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PQputCopyEnd doesn't adhere to its API contract

2014-09-09 Thread David Johnston
On Tue, Sep 9, 2014 at 12:03 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Sep 8, 2014 at 6:20 PM, David G Johnston
 david.g.johns...@gmail.com wrote:
  On Mon, Sep 8, 2014 at 11:45 AM, Robert Haas [via PostgreSQL] [hidden
  email] wrote:
 
  On Thu, Sep 4, 2014 at 6:38 PM, David Johnston
  [hidden email] wrote:
 


  Ignoring style and such did anything I wrote help to clarify your
  understanding of why pgPutCopyEnd does what it does?  As I say this and
  start to try and read the C code I think that it is a good source for
  understanding novice assumptions but there is a gap between the docs and
 the
  code - though I'm not sure you've identified the only (or even proper)
  location.

 Honestly, not really.  I thought the language I previously discussed
 with Tom was adequate for that; and I'm a little confused as to why
 we've gotten on what seems to be somewhat of a digression into nearby
 but otherwise-unrelated documentation issues.


​My theory here is that if you discover a single point of confusion that
cannot by attributed to a typo (or something basic like that) then why not
go looking around and see if there are any other issues in nearby
code/documentation.  Furthermore, not being the writer of said
code/documentation, ​a fresh perspective of the entire body of work - and
not just the few lines you quoted - has value.  For me, if nothing else I
took it as a chance to learn and evaluate the status quo.  And as noted
below there are a couple of items to address in the nearby code even if the
documentation itself is accurate - I just took a very indirect route to
discover them.

Much of the current documentation is copy-and-pasted directly from the
source code - with a few comments tacked on for clarity.  I'm sure some
layout and style concerns were present during the copy-and-paste but the
user-facing documentation does not, and probably should not ideally,
directly emulate the source code comments.  Admittedly, in the libpq
section this can be considerably relaxed since the target user is likely a
C programmer.



  Unlike PQputCopyData there is no explicit logic (pqIsnonblocking(conn)
 in
  PQputCopyEnd for non-blocking mode (there is an implicit one via
 pqFlush).
  This does seem like an oversight - if a minor one since the likihood of
 not
  being able to add the EOF to the connection's buffer seems highly
 unlikely -
  but I would expect on the basis of symmetry alone - for both of them to
 have
  buffer filled testing logic.  And, depending on how large *errormsg is,
 the
  risk of being unable to place the data in the buffer isn't as small and
 the
  expected EOF case.

 Yeah, I think that's a bug in PQputCopyEnd().  It should have the same
 kind of pqCheckOutBufferSpace() check that PQputCopyData() does.


​Anybody disagree here?​

​



  I'm getting way beyond my knowledge level here but my assumption from the
  documentation was that the async mode behavior of returning zero revolved
  around retrying in order to place the data into the buffer - not
 retrying to
  make sure the buffer is empty.  The caller deals with that on their own
  based upon their blocking mode decision.  Though we would want to call
  pqFlush for blocking mode callers here since this function should block
  until the buffer is clear.

 PQputCopyEnd() already does flush; and PQputCopyData() shouldn't flush.


​My point being pgFlush is sensitive to whether the caller is in blocking
mode.

Also, pgPutCopyData DOES FLUSH if the buffer is full...then attempts to
resize the buffer...then returns 0 (or -1 in blocking mode).  How absolute
is Tom's

Well, we certainly do NOT want a flush in PQputCopyData.

?


  So, I thought I agreed with your suggestion that if the final pqFlush
  returns a 1 that pqPutCopyEnd should return 0.

 Well, Tom set me straight on that; so I don't think we're considering
 any such change.  I think you need to make sure you understand the
 previous discussion in detail before proposing how to adjust the
 documentation (or the code).


​That was largely what this exercise was about for me...having gone through
all this and now re-reading Tom's wording I do understand and agree.

For surround documentation ​I think we are good since pqPutCopyData already
tests the buffer and correctly returns 0 in non-blocking mode.  The issue
with pqPutCopyEnd seems to be a copy/paste error and an incorrect
assumption regarding non-blocking mode.


This doesn't address the usage bug we've been propagating where someone
very well might use non-blocking mode and, upon seeing a return value of 1
from pqPutCopyEnd, immediately call pqGetResult even if the buffer is not
completed flushed.  Our proposed documentation directs people to always
pgFlush poll after calling pqPutCopyEnd while the current documentation is
not so strict.  Is there anything to say/do about that fact or - more
importantly - is there any real risk here?

​David J.
​


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-09 Thread Robert Haas
On Tue, Sep 9, 2014 at 12:33 PM, Petr Jelinek p...@2ndquadrant.com wrote:
 I am not sure if what Andres proposed is the right solution, but I do agree
 here that the hook definitely isn't.

Well, that doesn't get me very far.  Andres never responded to my
previous comments about why I did it that way, and you're not
proposing a clear alternative that I can either implement or comment
on.

 I am also not sure that I like the pq_redirect_to_shm_mq being directly
 exposed for use in bgworker, what I would like is to have elog interface to
 which you tell that you want errors sent to shm_mq handle and that one would
 then do the necessary calls (It's basically same principle but for the sake
 of cleanliness/correctness I think it should be elog and not pq from the
 point of bgworker).

I think that's completely wrong.  As the patch series demonstrates,
it's not limited to propagating ErrorResponse and NoticeResponse.  It
can also propagate NotifyResponse and RowDescription and DataRow and
anything else that comes along.  We are not just propagating errors;
we are propagating all protocol messages of whatever type.  So tying
it to elog specifically is not right.

 So the interface needs work.

 I do agree that we don't need two way communication over this channel, I
 think the dsm_mq can be used directly quite well for generic usecase.

I'm glad you agree, but that leaves me baffled as to what's wrong with
the hook approach.  There's no crying need for extensibility in this
code; it's barely changed in a very long time, and I know of no other
need which might soon require changing it again.  The hook is very
lightweight and shouldn't materially affect performance when it's not
used, as a more complex approach might.

 Otherwise I like the patch, and I am glad you also made the GUC save/restore
 infrastructure.

Cool.

 Please don't forget to add this to next commitfest.

OK, done.  But it would be awfully nice if we could actually make some
progress on hammering out a design everyone can live with here.
Letting it sit for another 5 weeks is not going to help anything, and
it will hold up getting any more stuff after this done in time for
9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-09-09 Thread Petr Jelinek

Hi,

I worked bit on this patch to make it closer to committable state.

There are several bugs fixed, including ones mentioned by Jamie (writing 
WAL during recovery).


Also support for pg_resetxlog/pg_upgrade has been implemented by Andres.

I added simple regression test and regression contrib module to cover 
both off and on settings.


The SLRU issue Heikki mentioned should be also gone mainly thanks to 
638cf09e7 (I did test it too).


One notable thing missing is documentation for the three SQL level 
interfaces provided, I plan to add that soon.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status(Setting next transaction ID and epoch for new cluster);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  \%s/pg_resetxlog\ -f -x %u \%s\,
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  \%s/pg_resetxlog\ -f -x %u -c %u \%s\,
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  \%s/pg_resetxlog\ -f -e %u \%s\,
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index cbcaaa6..81cbcaf 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -9,6 +9,7 @@
 #include postgres.h
 
 #include access/clog.h
+#include access/committs.h
 #include access/gin.h
 #include access/gist_private.h
 #include access/hash.h
diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore
new file mode 100644
index 000..1f95503
--- /dev/null
+++ b/contrib/test_committs/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile
new file mode 100644
index 000..2240749
--- /dev/null
+++ b/contrib/test_committs/Makefile
@@ -0,0 +1,45 @@
+# Note: because we don't tell the Makefile there are any regression tests,
+# we have to clean those result files explicitly
+EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_committs
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We can't support installcheck because normally installcheck users don't have
+# the required track_commit_timestamp on
+installcheck:;
+
+check: regresscheck
+
+submake-regress:
+	$(MAKE) -C $(top_builddir)/src/test/regress all
+
+submake-test_committs:
+	$(MAKE) -C $(top_builddir)/contrib/test_committs
+
+REGRESSCHECKS=committs_on
+
+regresscheck: all | submake-regress submake-test_committs
+	$(MKDIR_P) regression_output
+	$(pg_regress_check) \
+	--temp-config $(top_srcdir)/contrib/test_committs/committs.conf \
+	--temp-install=./tmp_check \
+	--extra-install=contrib/test_committs \
+	--outputdir=./regression_output \
+	$(REGRESSCHECKS)
+
+regresscheck-install-force: | submake-regress submake-test_committs
+	$(pg_regress_installcheck) \
+	--extra-install=contrib/test_committs \
+	$(REGRESSCHECKS)
+
+PHONY: submake-test_committs submake-regress check \
+	regresscheck regresscheck-install-force
\ No newline at end of file
diff --git a/contrib/test_committs/committs.conf b/contrib/test_committs/committs.conf
new file mode 100644
index 000..d221a60
--- /dev/null
+++ b/contrib/test_committs/committs.conf
@@ -0,0 +1 @@
+track_commit_timestamp = on
\ No newline at end of file
diff --git a/contrib/test_committs/expected/committs_on.out b/contrib/test_committs/expected/committs_on.out
new file mode 100644
index 000..9920343
--- /dev/null
+++ b/contrib/test_committs/expected/committs_on.out
@@ -0,0 +1,21 @@
+--
+-- Commit Timestamp (on)
+--
+CREATE TABLE committs_test(id serial, ts timestamptz default now());
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+INSERT INTO committs_test DEFAULT VALUES;
+SELECT id, pg_get_transaction_extradata(xmin),
+   pg_get_transaction_committime(xmin) = ts,
+   pg_get_transaction_committime(xmin)  now(),
+   pg_get_transaction_committime(xmin) - ts  '60s' -- 60s should give a lot of reserve
+FROM committs_test
+ORDER BY id;
+ id | pg_get_transaction_extradata | ?column? | ?column? | ?column? 
++--+--+--+--
+  1 |0 | t| t| t
+  2 |0 | t| t| t
+  3 |  

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-09 Thread Petr Jelinek

On 09/09/14 18:49, Robert Haas wrote:

On Tue, Sep 9, 2014 at 12:33 PM, Petr Jelinek p...@2ndquadrant.com wrote:

I am also not sure that I like the pq_redirect_to_shm_mq being directly
exposed for use in bgworker, what I would like is to have elog interface to
which you tell that you want errors sent to shm_mq handle and that one would
then do the necessary calls (It's basically same principle but for the sake
of cleanliness/correctness I think it should be elog and not pq from the
point of bgworker).


I think that's completely wrong.  As the patch series demonstrates,
it's not limited to propagating ErrorResponse and NoticeResponse.  It
can also propagate NotifyResponse and RowDescription and DataRow and
anything else that comes along.  We are not just propagating errors;
we are propagating all protocol messages of whatever type.  So tying
it to elog specifically is not right.



Oh in that case, I think what Andres proposed is actually quite good. I 
know the hook works fine it just seems like using somewhat hackish 
solution to save 20 lines of code.
The reason why I am not proposing anything better is that my solution 
when I did prototyping in this area was to just check if my pq_dsm_mq 
handle is set or not and call the appropriate function from the wrapper 
based on that which does not seem like big improvement over the hook 
approach... Anything better/more flexible that I could come up with is 
basically same idea what Andres already wrote.



Please don't forget to add this to next commitfest.


OK, done.  But it would be awfully nice if we could actually make some
progress on hammering out a design everyone can live with here.
Letting it sit for another 5 weeks is not going to help anything, and
it will hold up getting any more stuff after this done in time for
9.5.



Yeah I said that just as formality, I wrote the response now and not in 
5 weeks exactly for this reason, I am all for discussing this now and I 
think it's important to hammer this infrastructure out sooner rather 
than later.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-09 Thread Robert Haas
On Mon, Sep 8, 2014 at 7:10 PM, Andres Freund and...@2ndquadrant.com wrote:
 This has been pending for almost two months now and, at your request,
 my patch to make spinlocks act as compiler barriers is waiting behind
 it.  Can we please get this moving again soon, or can I commit that
 patch and you can fix this when you get around to it?

 I finally pushed this. And once more I seriously got pissed at the poor
 overall worldwide state of documentation and continously changing
 terminology around this.

There does seem to be a deficit in that area.

 Sorry for taking this long :(

 Do you have a current version of your patch to make them compiler
 barriers?

I had forgotten that it needed an update.  Thanks for the reminder.  Here's v2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index efe1b43..38dc34d 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -154,6 +154,17 @@ s_lock(volatile slock_t *lock, const char *file, int line)
 	return delays;
 }
 
+#ifdef USE_DEFAULT_S_UNLOCK
+void
+s_unlock(slock_t *lock)
+{
+#ifdef TAS_ACTIVE_WORD
+	*TAS_ACTIVE_WORD(lock) = -1;
+#else
+	*lock = 0;
+#endif
+}
+#endif
 
 /*
  * Set local copy of spins_per_delay during backend startup.
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 06dc963..8e0c4c3 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -55,14 +55,16 @@
  *	on Alpha TAS() will fail if interrupted.  Therefore a retry loop must
  *	always be used, even if you are certain the lock is free.
  *
- *	Another caution for users of these macros is that it is the caller's
- *	responsibility to ensure that the compiler doesn't re-order accesses
- *	to shared memory to precede the actual lock acquisition, or follow the
- *	lock release.  Typically we handle this by using volatile-qualified
- *	pointers to refer to both the spinlock itself and the shared data
- *	structure being accessed within the spinlocked critical section.
- *	That fixes it because compilers are not allowed to re-order accesses
- *	to volatile objects relative to other such accesses.
+ *	It is the responsibility of these macros to make sure that the compiler
+ *	does not re-order accesses to shared memory to precede the actual lock
+ *	acquisition, or follow the lock release.  Prior to PostgreSQL 9.5, this
+ *	was the caller's responsibility, which meant that callers had to use
+ *	volatile-qualified pointers to refer to both the spinlock itself and the
+ *	shared data being accessed within the spinlocked critical section.  This
+ *	was notationally awkward, easy to forget (and thus error-prone), and
+ *	prevented some useful compiler optimizations.  For these reasons, we
+ *	now require that the macros themselves prevent compiler re-ordering,
+ *	so that the caller doesn't need to take special precautions.
  *
  *	On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
  *	S_UNLOCK() macros must further include hardware-level memory fence
@@ -484,14 +486,14 @@ tas(volatile slock_t *lock)
 #define S_UNLOCK(lock)	\
 do \
 { \
-	__asm__ __volatile__ (	lwsync \n); \
+	__asm__ __volatile__ (	lwsync \n ::: memory); \
 	*((volatile slock_t *) (lock)) = 0; \
 } while (0)
 #else
 #define S_UNLOCK(lock)	\
 do \
 { \
-	__asm__ __volatile__ (	sync \n); \
+	__asm__ __volatile__ (	sync \n ::: memory); \
 	*((volatile slock_t *) (lock)) = 0; \
 } while (0)
 #endif /* USE_PPC_LWSYNC */
@@ -599,7 +601,9 @@ do \
 		   .set noreorder  \n \
 		   .set nomacro\n \
 		   sync\n \
-		   .set pop  ); \
+		   .set pop  
+:
+:		memory);
 	*((volatile slock_t *) (lock)) = 0; \
 } while (0)
 
@@ -657,6 +661,23 @@ tas(volatile slock_t *lock)
 typedef unsigned char slock_t;
 #endif
 
+/*
+ * Note that this implementation is unsafe for any platform that can speculate
+ * a memory access (either load or store) after a following store.  That
+ * happens not to be possible x86 and most legacy architectures (some are
+ * single-processor!), but many modern systems have weaker memory ordering.
+ * Those that do must define their own version S_UNLOCK() rather than relying
+ * on this one.
+ */
+#if !defined(S_UNLOCK)
+#if defined(__INTEL_COMPILER)
+#define S_UNLOCK(lock)	\
+	do { __memory_barrier(); *(lock) = 0; } while (0)
+#else
+#define S_UNLOCK(lock)	\
+	do { __asm__ __volatile__( : : : memory);  *(lock) = 0; } while (0)
+#endif
+#endif
 
 #endif	/* defined(__GNUC__) || defined(__INTEL_COMPILER) */
 
@@ -730,9 +751,13 @@ tas(volatile slock_t *lock)
 	return (lockval == 0);
 }
 
-#endif /* __GNUC__ */
+#define S_UNLOCK(lock)	\
+	do { \
+		__asm__ __volatile__( : : : memory); \
+		*TAS_ACTIVE_WORD(lock) = -1; \
+	} while (0)
 
-#define S_UNLOCK(lock)	(*TAS_ACTIVE_WORD(lock) = -1)
+#endif /* __GNUC__ */
 
 #define 

Re: [HACKERS] posix_fadvise() and pg_receivexlog

2014-09-09 Thread Robert Haas
On Tue, Sep 9, 2014 at 8:07 AM, didier did...@gmail.com wrote:
 Well, I'd like to hear someone from the field complaining that
 pg_receivexlog is thrashing the cache and thus reducing the performance of
 some other process. Or a least a synthetic test case that demonstrates that
 happening.
 It's not with pg_receivexlog but it's related.

 On a small box without replication server connected perfs were good
 enough but not so with a replication server connected, there was 1GB
 worth of WAL sitting in RAM vs next to nothing without slave!
 setup:
 8GB RAM
 2GB shared_buffers (smaller has other issues)
 checkpoint_segments 40 (smaller value trigger too much xlog checkpoint)
 checkpoints spread over 10 mn and write 30 to 50% of shared buffers.
 live data set fit in RAM.
 constant load.

 On startup (1 or 2/hour) applications were running requests on cold
 data which were now saturating IO.
 I'm not sure it's an OS bug as the WAL were 'hotter' than the cold data.

 A cron task every minute with vmtouch -e for evicting old WAL files
 from memory has solved the issue.

That seems like pretty good evidence that it might be worth doing
something here.  But I still think maybe it should be optional,
because if the user plans to reread those files and, say, copy them
somewhere else, then they won't want this behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Built-in binning functions

2014-09-09 Thread Tom Lane
Petr Jelinek p...@2ndquadrant.com writes:
 On 07/09/14 21:09, Andres Freund wrote:
 On 2014-09-07 15:05:35 -0400, Tom Lane wrote:
 I think the main remaining issue is that we don't have consensus on
 the function name AFAICT.  I'm okay with using width_bucket(), as
 is done in the latest patch, but there were objections ...

 Just reread that part of the thread and personally I disliked all the
 other suggested names more than width_bucket.

 Same here, that's why I didn't change it.

Not hearing any further discussion, I committed it with that name
(and a bit of further cleanup).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-09-09 Thread Andres Freund
Hi,

Given we already have three topics for --help and I can see others I
went with my --help= proposal.

On 2014-08-28 13:20:07 +0200, Andres Freund wrote:
 Some stuff I changed:
 * I rephrased the sgml changes
 * s/Printing options/Display options/. Or maybe Display influencing
   variables? That makes it clearer why they're listed under
   --help-variables.
 * I added \? commands as an alias for a plain \?
   That way the scheme can sensibly be expanded.
 * I renamed help_variables() to be inline with the surrounding functions.

I integrated all those ontop of your help-variables-12.patch.

Then I:
* re-added psql -i which you, probably accidentally, removed
* rephrased the sgml stuff
* removed the Report bugs to  from helpVariables()
* Changed things so that both --help and \? support commands,
  options and variables as help topics
* fixed things so that --help=wrongoption returns a proper exit code
  again

I attached both the full and the incremental diff. I'd appreciate
somebody looking over the the docs...

I plan to push this soon.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 9fcf3f0983ccd617c20dc4e87889a894d873b92e Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Tue, 9 Sep 2014 22:19:14 +0200
Subject: [PATCH] Add new psql help topics, accessible to both --help and \?.

Add --help=topic for the commandline, and \? topic as a backslash
command, to show more help than the invocations without parameters
do. commands, variables and options currently exist as help
topics describing, respectively, backslash commands, psql variables,
and commandline switches. Without parameters the help commands show
their previous topic.

Author: Pavel Stehule, editorialized by many

Reviewed-By: Andres Freund, Petr Jelinek, Fujii Masao, MauMau, Abhijit
Menon-Sen and Erik Rijkers.

Discussion: CAFj8pRDVGuC-nXBfe2CK8vpyzd2Dsr9GVpbrATAnZO=2yq0...@mail.gmail.com,
cafj8pra54abtv2rxdtrxiad8hy8wxmovlqhjdrcwenhdd7o...@mail.gmail.com
---
 doc/src/sgml/ref/psql-ref.sgml |  23 -
 src/bin/psql/command.c |  14 ++-
 src/bin/psql/help.c| 210 +++--
 src/bin/psql/help.h|   4 +-
 src/bin/psql/startup.c |  29 --
 src/bin/psql/tab-complete.c|   7 ++
 6 files changed, 224 insertions(+), 63 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 29ad1aa..aa71674 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -560,11 +560,17 @@ EOF
 
 varlistentry
   termoption-?//term
-  termoption--help//term
+  termoption--help[=replaceable class=parametertopic/]/option/term
   listitem
   para
-  Show help about applicationpsql/application command line
-  arguments, and exit.
+  Show help about applicationpsql/application and exit. The optional
+  replaceable class=parametertopic/ parameter (defaulting
+  to literaloptions/literal) selects which part of psql is
+  explained: literalcommands/ describes applicationpsql/'s
+  backslash commands; literaloptions/ describes the commandline
+  switches that can be passed to applicationpsql/;
+  and literalvariables/ shows help about about psql configuration
+  variables.
   /para
   /listitem
 /varlistentry
@@ -2574,10 +2580,17 @@ testdb=gt; userinput\setenv LESS -imx4F/userinput
 
 
   varlistentry
-termliteral\?/literal/term
+termliteral\? [ replaceable class=parametertopic/ ]/literal/term
 listitem
 para
-Shows help information about the backslash commands.
+Shows help information. The optional
+replaceable class=parametertopic/ parameter
+(defaulting to literalcommands/) selects which part of psql is
+explained: literalcommands/ describes applicationpsql/'s
+backslash commands; literaloptions/ describes the commandline
+switches that can be passed to applicationpsql/;
+and literalvariables/ shows help about about psql configuration
+variables.
 /para
 /listitem
   /varlistentry
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 39b5777..5d90ca2 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1491,7 +1491,19 @@ exec_command(const char *cmd,
 
 	/* \? -- slash command help */
 	else if (strcmp(cmd, ?) == 0)
-		slashUsage(pset.popt.topt.pager);
+	{
+		char	   *opt0 = psql_scan_slash_option(scan_state,
+	OT_NORMAL, NULL, false);
+
+		if (!opt0 || strcmp(opt0, commands) == 0)
+			slashUsage(pset.popt.topt.pager);
+		else if (strcmp(opt0, options) == 0)
+			usage(pset.popt.topt.pager);
+		else if (strcmp(opt0, variables) == 0)
+			helpVariables(pset.popt.topt.pager);
+		else
+			slashUsage(pset.popt.topt.pager);
+	}
 
 #if 0
 
diff --git a/src/bin/psql/help.c 

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-09 Thread Robert Haas
On Tue, Sep 9, 2014 at 1:18 PM, Petr Jelinek p...@2ndquadrant.com wrote:
 I think that's completely wrong.  As the patch series demonstrates,
 it's not limited to propagating ErrorResponse and NoticeResponse.  It
 can also propagate NotifyResponse and RowDescription and DataRow and
 anything else that comes along.  We are not just propagating errors;
 we are propagating all protocol messages of whatever type.  So tying
 it to elog specifically is not right.

 Oh in that case, I think what Andres proposed is actually quite good. I know
 the hook works fine it just seems like using somewhat hackish solution to
 save 20 lines of code.

If it's 20 lines of code, I'm probably fine to go that way.  Let's see
if we can figure out what those 20 lines look like.

libpq.h exports 29 functions that do a variety of different things.
Of those, 20 are in pqcomm.c and the others are in be-secure.c.  I
presume that pluggability for the latter group, if needed at all, is a
separate project.  The remaining ones break down like this:

- StreamServerPort, StreamConnection, StreamClose, and
TouchSocketFiles are intended to be called only from the postmaster,
to set up and tear down the listening socket and individual
connections.  Presumably this is not what we care about here.
- pq_getbytes(), pq_getstring(), pq_getmessage(), pq_getbyte(),
pq_peekbyte(), and pq_getbyte_if_available() handle *reading* data
from the socket.  Since you previously agreed that we didn't need to
build two-way communication on top of this, I would thank that would
mean that these don't need to be pluggable either.  But maybe I'm
wrong.
- The remaining functions are pq_init(), pq_comm_reset(), pq_flush(),
pq_flush_if_writable(), pq_is_send_pending(), pq_putmessage(),
pq_putmessage_noblock(), pq_startcopyout(), and pq_endcopyout().
These are the ones that I think are potentially interesting.

I didn't choose to provide hooks for all of these in the submitted
patch because they're not all needed for I want to do here:
pq_startcopyout() and pq_endcopyout() are only needed for V2 protocol
support, which did not interest me (nor did COPY, really);
pq_putmessage_noblock(), pq_flush_if_writable(), and
pq_is_send_pending() are only used for the walsender protocol, which
doesn't seem useful to redirect to a non-socket; and I just didn't
happen to have any use for pq_init() or pq_comm_reset().  Hence what I
ended up with.

But, I could revisit that.  Suppose I provide a structure with 10
function pointers for all ten of those functions, or maybe 9 since
pq_init() is called so early that it's not likely we'll have control
to put the hooks in place before that point, and anyway whatever code
installs the hooks can do its own initialization then.  Then make a
global variable like pqSendMethods and #define pq_comm_reset() to be
pqSendMethods-comm_reset(), pflush() to be pqSendMethods-flush(),
and so on for all 9 or 10 methods.  Then the pqmq code could just
change pqSendMethods to point to its own method structure instead of
the default one.  Would that address the concern this concern?  It's
more than 20 lines of code, but it's not TOO bad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-09 Thread Robert Haas
On Thu, Sep 4, 2014 at 5:46 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote:
 Eh, maybe?  I'm not sure why the case where we're using abbreviated
 keys should be different than the case we're not.  In either case this
 is a straightforward trade-off: if we do a memcmp() before strcoll(),
 we win if it returns 0 and lose if returns non-zero and strcoll also
 returns non-zero.  (If memcmp() returns non-zero but strcoll() returns
 0, it's a tie.)  I'm not immediately sure why it should affect the
 calculus one way or the other whether abbreviated keys are in use; the
 question of how much faster memcmp() is than strcoll() seems like the
 relevant consideration either way.

 Not quite. Consider my earlier example of sorting ~300,000 cities by
 country only. That's a pretty low cardinality attribute. We win big,
 and we are almost certain that the abbreviated key cardinality is a
 perfect proxy for the full key cardinality so we stick with
 abbreviated keys while copying over tuples. Sure, most comparisons
 will actually be resolved with a memcmp() == 0 rather than an
 abbreviated comparison, but under my ad-hoc cost model there is no
 distinction, since they're both very much cheaper than a strcoll()
 (particularly when we factor in the NUL termination copying that a
 memcmp() == 0 also avoids). To a lesser extent we're also justified
 in that optimism because we've already established that roughly the
 first 8 bytes of the string are bitwise equal.

 So the difference is that in the abbreviated key case, we are at least
 somewhat justified in our optimism. Whereas, where we're just eliding
 fmgr overhead, say on the 2nd or subsequent attribute of a multi-key
 sort, it's totally opportunistic to chance a memcmp() == 0. The
 latter optimization can only be justified by the fact that the
 memcmp() is somewhere between dirt cheap and free. That seems like
 soemthing that should significantly impact the calculus.

Boiled down, what you're saying is that you might have a set that
contains lots of duplicates in general, but not very many where the
abbreviated-keys also match.  Sure, that's true.  But you might also
not have that case, so I don't see where that gets us; the same
worst-case test case Heikki developed the last time we relitigated
this point is still relevant here.  In order to know how much we're
giving up in that case, we need the exact number I asked you to
provide in my previous email: the ratio of the cost of strcoll() to
the cost of memcmp().

I see that you haven't chosen to provide that information in any of
your four responses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-09 Thread Andres Freund
On 2014-09-09 13:52:40 -0400, Robert Haas wrote:
 I had forgotten that it needed an update.  Thanks for the reminder.  Here's 
 v2.

I've attached a incremental patch fixing minor gripes. Other than that I
think you can go ahead with this once the buildfarm accepts the sparc
fixes (man, those machines are slow. spoonbill apparently takes ~5h for
one run).

I've done a read through s_lock.h and the only remaining potential issue
that I see is that I have no idea if unixware's tas() is actually a safe
compiler barrier (it is a memory barrier). And I really, really can't
make myself care.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 38dc34d..e8d3502 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -159,6 +159,7 @@ void
 s_unlock(slock_t *lock)
 {
 #ifdef TAS_ACTIVE_WORD
+	/* HP's PA-RISC */
 	*TAS_ACTIVE_WORD(lock) = -1;
 #else
 	*lock = 0;
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 8e0c4c3..5f62a2c 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -57,8 +57,8 @@
  *
  *	It is the responsibility of these macros to make sure that the compiler
  *	does not re-order accesses to shared memory to precede the actual lock
- *	acquisition, or follow the lock release.  Prior to PostgreSQL 9.5, this
- *	was the caller's responsibility, which meant that callers had to use
+ *	acquisition, or following the lock release.  Prior to PostgreSQL 9.5,
+ *  this was the caller's responsibility, which meant that callers had to use
  *	volatile-qualified pointers to refer to both the spinlock itself and the
  *	shared data being accessed within the spinlocked critical section.  This
  *	was notationally awkward, easy to forget (and thus error-prone), and
@@ -403,7 +403,7 @@ tas(volatile slock_t *lock)
  * No stbar or membar available, luckily no actually produced hardware
  * requires a barrier.
  */
-#define S_UNLOCK(lock)		(*((volatile slock_t *) (lock)) = 0)
+/* use the default S_UNLOCK definition for gcc */
 #elif  __sparcv8
 /* stbar is available (and required for both PSO, RMO), membar isn't */
 #define S_UNLOCK(lock)	\
@@ -921,14 +921,14 @@ extern int	tas_sema(volatile slock_t *lock);
  * store) after a following store; platforms where this is possible must
  * define their own S_UNLOCK.  But CPU reordering is not the only concern:
  * if we simply defined S_UNLOCK() as an inline macro, the compiler might
- * reorder instructions from the critical section to occur after the lock
- * release.  Since the compiler probably can't know what the external
+ * reorder instructions from inside the critical section to occur after the
+ * lock release.  Since the compiler probably can't know what the external
  * function s_unlock is doing, putting the same logic there should be adequate.
  * A sufficiently-smart globally optimizing compiler could break that
  * assumption, though, and the cost of a function call for every spinlock
  * release may hurt performance significantly, so we use this implementation
  * only for platforms where we don't know of a suitable intrinsic.  For the
- * most part, those are relatively obscure platform/compiler platforms to 
+ * most part, those are relatively obscure platform/compiler platforms to
  * which the PostgreSQL project does not have access.
  */
 #define USE_DEFAULT_S_UNLOCK

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-09 Thread Robert Haas
On Fri, Sep 5, 2014 at 10:45 PM, Peter Geoghegan p...@heroku.com wrote:
 While I gave serious consideration to your idea of having a dedicated
 abbreviation comparator, and not duplicating sortsupport state when
 abbreviated keys are used (going so far as to almost fully implement
 the idea), I ultimately decided that my vote says we don't do that. It
 seemed to me that there were negligible benefits for increased
 complexity. In particular, I didn't want to burden tuplesort with
 having to worry about whether or not abbreviation was aborted during
 tuple copying, or was not used by the opclass in the first place -
 implementing your scheme makes that distinction relevant. It's very
 convenient to have comparetup_heap() compare the leading sort key
 (that specifically looks at SortTuple.datum1 pairs) indifferently,
 using the same comparator for abbreviated and not abbreviated
 cases indifferently. comparetup_heap() does not seem like a great
 place to burden with caring about each combination any more than
 strictly necessary.

 I like that I don't have to care about every combination, and can
 treat abbreviation abortion as the special case with the extra step,
 in line with how I think of the optimization conceptually. Does that
 make sense?

No.  comparetup_heap() is hip-deep in this optimization as it stands,
and what I proposed - if done correctly - isn't going to make that
significantly worse.  In fact, it really ought to make things better:
you should be able to set things up so that ssup-comparator is always
the test that should be applied first, regardless of whether we're
aborted or not-aborted or not doing this in the first place; and then
ssup-tiebreak_comparator, if not NULL, can be applied after that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-09 Thread Robert Haas
On Tue, Sep 9, 2014 at 5:09 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-09 13:52:40 -0400, Robert Haas wrote:
 I had forgotten that it needed an update.  Thanks for the reminder.  Here's 
 v2.

 I've attached a incremental patch fixing minor gripes. Other than that I
 think you can go ahead with this once the buildfarm accepts the sparc
 fixes (man, those machines are slow. spoonbill apparently takes ~5h for
 one run).

 I've done a read through s_lock.h and the only remaining potential issue
 that I see is that I have no idea if unixware's tas() is actually a safe
 compiler barrier (it is a memory barrier). And I really, really can't
 make myself care.

  *It is the responsibility of these macros to make sure that the compiler
  *does not re-order accesses to shared memory to precede the actual lock
- *acquisition, or follow the lock release.  Prior to PostgreSQL 9.5, this
- *was the caller's responsibility, which meant that callers had to use
+ *acquisition, or following the lock release.  Prior to PostgreSQL 9.5,
+ *  this was the caller's responsibility, which meant that callers had to use

AFAICS my version is right and your version is grammatically
incorrect. re-order to proceed or follow uses the same verb tense in
both branches of the or; re-order to proceed or following does not.

I agree that if there are problems on UnixWare, we can let anyone who
cares about UnixWare submit a patch to fix them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-09 Thread Andres Freund
On 2014-09-09 17:30:44 -0400, Robert Haas wrote:
 On Tue, Sep 9, 2014 at 5:09 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-09-09 13:52:40 -0400, Robert Haas wrote:
  I had forgotten that it needed an update.  Thanks for the reminder.  
  Here's v2.
 
  I've attached a incremental patch fixing minor gripes. Other than that I
  think you can go ahead with this once the buildfarm accepts the sparc
  fixes (man, those machines are slow. spoonbill apparently takes ~5h for
  one run).
 
  I've done a read through s_lock.h and the only remaining potential issue
  that I see is that I have no idea if unixware's tas() is actually a safe
  compiler barrier (it is a memory barrier). And I really, really can't
  make myself care.
 
   *It is the responsibility of these macros to make sure that the compiler
   *does not re-order accesses to shared memory to precede the actual lock
 - *acquisition, or follow the lock release.  Prior to PostgreSQL 9.5, this
 - *was the caller's responsibility, which meant that callers had to use
 + *acquisition, or following the lock release.  Prior to PostgreSQL 9.5,
 + *  this was the caller's responsibility, which meant that callers had to use
 
 AFAICS my version is right and your version is grammatically
 incorrect. re-order to proceed or follow uses the same verb tense in
 both branches of the or; re-order to proceed or following does not.

Wasn't sure about that one. It read oddly to me, but then I'm not a
native speaker. And won't read the sentence often ;)

 I agree that if there are problems on UnixWare, we can let anyone who
 cares about UnixWare submit a patch to fix them.

Ok.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-09 Thread Robert Haas
On Tue, Sep 9, 2014 at 5:32 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-09 17:30:44 -0400, Robert Haas wrote:
 On Tue, Sep 9, 2014 at 5:09 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-09-09 13:52:40 -0400, Robert Haas wrote:
  I had forgotten that it needed an update.  Thanks for the reminder.  
  Here's v2.
 
  I've attached a incremental patch fixing minor gripes. Other than that I
  think you can go ahead with this once the buildfarm accepts the sparc
  fixes (man, those machines are slow. spoonbill apparently takes ~5h for
  one run).
 
  I've done a read through s_lock.h and the only remaining potential issue
  that I see is that I have no idea if unixware's tas() is actually a safe
  compiler barrier (it is a memory barrier). And I really, really can't
  make myself care.

   *It is the responsibility of these macros to make sure that the 
 compiler
   *does not re-order accesses to shared memory to precede the actual lock
 - *acquisition, or follow the lock release.  Prior to PostgreSQL 9.5, 
 this
 - *was the caller's responsibility, which meant that callers had to use
 + *acquisition, or following the lock release.  Prior to PostgreSQL 9.5,
 + *  this was the caller's responsibility, which meant that callers had to 
 use

 AFAICS my version is right and your version is grammatically
 incorrect. re-order to proceed or follow uses the same verb tense in
 both branches of the or; re-order to proceed or following does not.

 Wasn't sure about that one. It read oddly to me, but then I'm not a
 native speaker. And won't read the sentence often ;)

 I agree that if there are problems on UnixWare, we can let anyone who
 cares about UnixWare submit a patch to fix them.

 Ok.

So, that's committed, then. I think we should pick something that uses
spinlocks and is likely to fail spectacularly if we haven't got this
totally right yet, and de-volatilize it.  And then watch to see what
turns red in the buildfarm and/or which users start screaming.  I'm
inclined to propose lwlock.c as a candidate, since that's very widely
used and a place where we know there's significant contention.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-09 Thread Andres Freund
On 2014-09-09 17:54:03 -0400, Robert Haas wrote:
 So, that's committed, then.

Yay, finally.

 I think we should pick something that uses
 spinlocks and is likely to fail spectacularly if we haven't got this
 totally right yet, and de-volatilize it.  And then watch to see what
 turns red in the buildfarm and/or which users start screaming.

Good plan.

 I'm inclined to propose lwlock.c as a candidate, since that's very
 widely used and a place where we know there's significant contention.

I suggest, additionally possibly, GetSnapshotData(). It's surely one of
the hottest functions in postgres, and I've seen some performance
increases from de-volatilizing it. IIRC it requires one volatile cast in
one place to enforce that a variable is accessed only once. Not sure if
we want to add such volatile casts or use something like linux'
ACCESS_ONCE macros for some common types. Helps to grep for places
worthy of inspection.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-09 Thread Petr Jelinek

On 09/09/14 22:48, Robert Haas wrote:

On Tue, Sep 9, 2014 at 1:18 PM, Petr Jelinek p...@2ndquadrant.com wrote:

I think that's completely wrong.  As the patch series demonstrates,
it's not limited to propagating ErrorResponse and NoticeResponse.  It
can also propagate NotifyResponse and RowDescription and DataRow and
anything else that comes along.  We are not just propagating errors;
we are propagating all protocol messages of whatever type.  So tying
it to elog specifically is not right.


Oh in that case, I think what Andres proposed is actually quite good. I know
the hook works fine it just seems like using somewhat hackish solution to
save 20 lines of code.


If it's 20 lines of code, I'm probably fine to go that way.  Let's see
if we can figure out what those 20 lines look like.

libpq.h exports 29 functions that do a variety of different things.
Of those, 20 are in pqcomm.c and the others are in be-secure.c.  I
presume that pluggability for the latter group, if needed at all, is a
separate project.  The remaining ones break down like this:



Ugh


- The remaining functions are pq_init(), pq_comm_reset(), pq_flush(),
pq_flush_if_writable(), pq_is_send_pending(), pq_putmessage(),
pq_putmessage_noblock(), pq_startcopyout(), and pq_endcopyout().
These are the ones that I think are potentially interesting.

I didn't choose to provide hooks for all of these in the submitted
patch because they're not all needed for I want to do here:
pq_startcopyout() and pq_endcopyout() are only needed for V2 protocol
support, which did not interest me (nor did COPY, really);
pq_putmessage_noblock(), pq_flush_if_writable(), and
pq_is_send_pending() are only used for the walsender protocol, which
doesn't seem useful to redirect to a non-socket; and I just didn't
happen to have any use for pq_init() or pq_comm_reset().  Hence what I
ended up with.

But, I could revisit that.  Suppose I provide a structure with 10
function pointers for all ten of those functions, or maybe 9 since
pq_init() is called so early that it's not likely we'll have control
to put the hooks in place before that point, and anyway whatever code
installs the hooks can do its own initialization then.  Then make a
global variable like pqSendMethods and #define pq_comm_reset() to be
pqSendMethods-comm_reset(), pflush() to be pqSendMethods-flush(),
and so on for all 9 or 10 methods.  Then the pqmq code could just
change pqSendMethods to point to its own method structure instead of
the default one.  Would that address the concern this concern?  It's
more than 20 lines of code, but it's not TOO bad.



Yes, although my issue with the hooks was not that you only provided 
them for 2 functions, but the fact that it had no structure and the 
implementation was if hook set do this, else do that which I don't see 
like a good way of doing it.


All I personally want is structure and extensibility so struct with just 
the needed functions is good enough for me and I would be ok to leave 
the other 8 functions just as a option for whomever needs to make them 
overridable in the future.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-09-09 Thread Andres Freund
On 2014-09-09 22:22:45 +0200, Andres Freund wrote:
 I plan to push this soon.

Done.

Thanks for the patch!

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] FD_SETSIZE on Linux?

2014-09-09 Thread Thom Brown
Hi,

I noticed when trying to set pgbench's client count to a high number, I had
to reduce it, and I found the maximum I can get away with is 1014.  Any
higher and I get:

invalid number of clients: 1015

I find this in pgbench.c:

#ifdef FD_SETSIZE
#define MAXCLIENTS  (FD_SETSIZE - 10)
#else
#define MAXCLIENTS  1024
#endif


And FS_SETSIZE defined before it:

#ifdef WIN32
#define FD_SETSIZE 1024 /* set before winsock2.h is
included */
#endif   /* ! WIN32 */

... but apparently only if using Windows, which I'm not.

So it appears that MAXCLIENTS is being set to 1014 (1024 - 10), which looks
like should only be the case on Windows.

I'm a bit confused here.  Shouldn't my MAXCLIENTS be set to 1024?

Thom


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-09 Thread Tomas Vondra
On 9.9.2014 16:09, Robert Haas wrote:
 On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra t...@fuzzy.cz wrote:
 So I only posted the separate patch for those who want to do a review,
 and then a complete patch with both parts combined. But it sure may be
 a bit confusing.
 
 Let's do this: post each new version of the patches only on this
 thread, as a series of patches that can be applied one after another.

OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch

 In ExecChooseHashTableSize(), if buckets_bytes is independent of
 nbatch, could we just compute it before working out dbatch, and just
 deduct it from hash_table_bytes? That seems like it'd do the same
 thing for less work.

 Well, we can do that. But I really don't think that's going to make
 measurable difference. And I think the code is more clear this way.
 
 Really?  It seems to me that you're doing more or less the same
 calculation that's already been done a second time.  It took me 20
 minutes of staring to figure out what it was really doing.  Now
 admittedly I was a little low on caffeine, but...

I reworked this part a bit, to make it easier to understand. Mostly
following your recommendations.

 
 Either way, what happens if buckets_bytes is already bigger than
 hash_table_bytes?

 Hm, I don't see how that could happen.
 
 How about an Assert() and a comment, then?

Done. According to the comment, we should never really get over 25%,
except maybe for strange work_mem values, because we're keeping nbuckets
as 2^N. But even then we should not reach 50%, so I added this assert:

   Assert(buckets_bytes = hash_table_bytes/2);

And then we subtract the buckets_bytes, and continue with the loop.

 
 A few more stylistic issues that I see:

 +   if ((hashtable-nbatch == 1) 
 +   if (hashtable-nbuckets_optimal = (INT_MAX/2))
 +   if (size  (HASH_CHUNK_SIZE/8))

 While I'm all in favor of using parentheses generously where it's
 needed to add clarity, these and similar usages seem over the top to
 me.

 It seemed more readable for me at the time of coding it, and it still
 seems better this way, but ...

 https://www.youtube.com/watch?v=CxK_nA2iVXw
 
 Heh.  Well, at any rate, I think PostgreSQL style is to not use
 parentheses in that situation.

FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I
also simplified the conditions a bit, and removed some of the parentheses.

 
 +char * chunk_alloc(HashJoinTable hashtable, int size) {

 Eh... no.

 +   /* XXX maybe we should use MAXALIGN(size) here ... */

 +1.

 I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
 should be on the next line? And is the +1 about doing MAXALING?
 
 The no is about the fact that you have the return type, the function
 name, and the opening brace on one line instead of three lines as is
 project style.  The +1 is for applying MAXALIGN to the size.

OK, fixed.

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
  The original naming seemed a bit awkward.

* renamed the function to 'dense_alloc' (instead of 'chunk_alloc')
  Seems like a better fit to what the function does.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
  threshold for 'oversized' items

  We need the threshold to be =8kB, to trigger the special case
  within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.


Tomas

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..d5acfb1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * dense_alloc(HashJoinTable hashtable, int tupleSize);
 
 /* 
  *		ExecHash
@@ -294,6 +295,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable-spaceAllowedSkew =
 		hashtable-spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable-chunks = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -556,10 +559,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable-nbatch;
 	int			curbatch = hashtable-curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	MemoryChunk	oldchunks = hashtable-chunks;
 
 	/* do nothing if we've decided to shut off growth */
 	if 

Re: [HACKERS] FD_SETSIZE on Linux?

2014-09-09 Thread Tom Lane
Thom Brown t...@linux.com writes:
 I find this in pgbench.c:

 #ifdef FD_SETSIZE
 #define MAXCLIENTS  (FD_SETSIZE - 10)
 #else
 #define MAXCLIENTS  1024
 #endif

FD_SETSIZE is supposed to be defined, according to the POSIX spec:

The sys/select.h header shall define the following symbolic constant,
which shall have a value suitable for use in #if preprocessing directives:

FD_SETSIZE
Maximum number of file descriptors in an fd_set structure.

It looks like Linux sets it to 1024.  On RHEL6, at least, I find this:

$ grep -r FD_SETSIZE /usr/include
/usr/include/linux/posix_types.h:#undef __FD_SETSIZE
/usr/include/linux/posix_types.h:#define __FD_SETSIZE   1024
...
/usr/include/sys/select.h:#define   FD_SETSIZE  __FD_SETSIZE
...

 #ifdef WIN32
 #define FD_SETSIZE 1024 /* set before winsock2.h is included 
 */
 #endif   /* ! WIN32 */

Windows probably hasn't got sys/select.h at all, so it may not provide
this symbol.

Interestingly, it looks like POSIX also requires sys/time.h to define
FD_SETSIZE.  I wonder whether Windows has that header?  It'd definitely
be better to get this symbol from the system than assume 1024 will work.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Escaping from blocked send() reprised.

2014-09-09 Thread Kyotaro HORIGUCHI
Hmm. Sorry, I misunderstood the specification.

 You approach that coloring tokens seems right, but you have
 broken the parse logic by adding your code.
 
 Other than the mistakes others pointed, I found that
 
 - non-SQL-ident like tokens are ignored by their token style,
   quoted or not, so the following line works.
 
 | local All aLL trust
 
 I suppose this is not what you intended. This is because you have
 igonred the attribute of a token when comparing it as
 non-SQL-ident tokens.
 
 
 - '+' at the head of the sequence '+' is treated as the first
   character of the *quoted* string. e.g. +hoge is tokenized as
   +hoge:special_quoted.

I found this is what intended. This should be documented as
comments.

|2) users and user-groups only requires special handling and behavior as follows
|Normal user :
|  A. unquoted ( USER ) will be treated as user ( downcase ).
|  B. quoted  ( USeR )  will be treated as USeR (case-sensitive).
|  C. quoted ( +USER ) will be treated as normal user +USER (i.e. will 
not be considered as user-group) and case-sensitive as string is quoted.

This seems confising with the B below. This seems should be
rearranged.

|   User Group :
|  A. unquoted ( +USERGROUP ) will be treated as +usergruop ( downcase ).
|  B. plus quoted ( +UserGROUP  ) will be treated as +UserGROUP 
(case-sensitive).



 This is why you simply continued processing for '+' without
 discarding and skipping the '+', and not setting in_quote so the
 following parser code works as it is not intended. You should
 understand what the original code does and insert or modify
 logics not braeking the assumptions.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-09 Thread Mark Kirkwood

On 05/09/14 23:50, Amit Kapila wrote:

On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:
 
  On 04/09/14 14:42, Amit Kapila wrote:
 
  On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
  wrote:
 
 
 
  Hi Amit,
 
  Results look pretty good. Does it help in the read-write case too?
 
 
  Last time I ran the tpc-b test of pgbench (results of which are
  posted earlier in this thread), there doesn't seem to be any major
  gain for that, however for cases where read is predominant, you
  might see better gains.
 
  I am again planing to take that data in next few days.
 
 
  FWIW below are some test results on the 60 core beast with this patch
applied to 9.4. I'll need to do more runs to iron out the variation,
  but it looks like the patch helps the standard (write heavy) pgbench
workload a little, and clearly helps the read only case.
 

Thanks for doing the test.  I think if possible you can take
the performance data with higher scale factor (4000) as it
seems your m/c has 1TB of RAM.  You might want to use
latest patch I have posted today.



Here's some fairly typical data from read-write and read-only runs at 
scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not 
seeing much variation between repeated read-write runs with the same 
config (which is nice - sleep 30 and explicit checkpoint call between 
each one seem to help there).


Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be 
better at higher client counts than beta1 was...


In terms of the effect of the patch - looks pretty similar to the scale 
2000 results for read-write, but read-only is a different and more 
interesting story - unpatched 9.4 is noticeably impacted in the higher 
client counts, whereas the patched version scales as well (or even 
better perhaps) than in the scale 2000 case.


read write (600s)

Clients  | tps| tps (unpatched)
-++
  6  |   9395 |   9334
  12 |  16605 |  16525
  24 |  24634 |  24910
  48 |  32170 |  31275
  96 |  35675 |  36533
 192 |  35579 |  31137
 384 |  30528 |  28308


read only (300s)

Clients  | tps| tps (unpatched)
-++
  6  |  35743 |   35362
  12 | 111019 |  106579
  24 | 199746 |  160305
  48 | 327026 |  198407
  96 | 379184 |  171863
 192 | 356623 |  152224
 384 | 340878 |  128308


regards

Mark





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and epoch

2014-09-09 Thread Greg Stark
On Tue, Sep 9, 2014 at 4:05 PM, Bruce Momjian br...@momjian.us wrote:
  Yes, I did think about that, but it seems like a behavior change.
  However, it is tempting to avoid future bug reports about this.

 When this came up in March, Tom and I agreed that this wasn't something
 we wanted to slip into 9.4.  Given that, it is hard to argue we should
 now slip this into 9.5, 9.4, and 9.3, so unless someone else votes for
 inclusion, I think I will leave this as 9.5-only.

 With no one replying, I will consider this issue closed and not
 backpatch this.

I think the reason nobody's responding is because nobody has anything
significant to add. It's a behaviour change from not-working to
working. Why wouldn't it be backpatched?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-09 Thread Etsuro Fujita

(2014/09/09 18:57), Heikki Linnakangas wrote:

On 09/03/2014 06:39 PM, Robert Haas wrote:

Now some people might argue that we have that anyway, but the fact is
that a lot of people are using inheritance today, even with all its
flaws, and they wouldn't be if there were a long laundry list of
limitations that didn't apply to regular tables.  We should be looking
to lift the limitations that currently exist, not add more.



I agree. If we are to support UPDATE .. ORDER BY .. LIMIT, it should
work with inheritance. So the path forward is (using Marko's phrasing
upthread):

1) We put the LIMIT inside ModifyTable like this patch does.  This
doesn't prevent us from doing ORDER BY in the future, but helps numerous
people who today have to
2) Someone rewrites how UPDATE works based on Tom's suggestion here:
http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us,
which allows us to support ORDER BY on all tables (or perhaps maybe not
FDWs, I don't know how those work).  The LIMIT functionality in this
patch is unaffected.

What's not clear to me is whether it make sense to do 1) without 2) ? Is
UPDATE .. LIMIT without support for an ORDER BY useful enough? And if we
apply this patch now, how much of it needs to be rewritten after 2) ? If
the answers are yes and not much, then we should review this patch
now, and put 2) on the TODO list. Otherwise 2) should do done first.


My answers are yes but completely rewritten.  So, I think we should 
work on 2) first ideally, but 2) seems a large project at least to me. 
So, I think it would be reasonable to implement UPDATE/DELETE .. LIMIT 
based on Rukh' patch for now and put 2) and the re-implementation of 1) 
after 2) on the TODO list.


I don't have enough time to review it for a while, so I'd like to work 
on it (and postgres_fdw's update pushdown enhancement related to it) 
at the next CF (I think I can do it earlier).  I must apologize to Rukh 
for not having enough time for the patch review.


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-09 Thread Etsuro Fujita

(2014/09/09 22:17), Fujii Masao wrote:

On Tue, Sep 9, 2014 at 1:28 AM, Jeff Janes jeff.ja...@gmail.com wrote:

I get some compiler warnings on v2 of this patch:

reloptions.c:219: warning: excess elements in struct initializer
reloptions.c:219: warning: (near initialization for 'intRelOpts[15]')



Attached is the updated version of the patch.


Thank you for updating the patch!

I took a quick review on the patch.  It looks good to me, but one thing 
I'm concerned about is


You wrote:
 The attached patch introduces the GIN index storage parameter
 PENDING_LIST_CLEANUP_SIZE which specifies the maximum size of
 GIN pending list. If it's not set, work_mem is used as that 
maximum size,

 instead. So this patch doesn't break the existing application which
 currently uses work_mem as the threshold of cleanup operation of
 the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.

As you mentioned, I think it's important to consider for the existing 
applications, but I'm wondering if it would be a bit confusing users to 
have two parameters, PENDING_LIST_CLEANUP_SIZE and work_mem, for this 
setting.  Wouldn't it be easy-to-use to have only one parameter, 
PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE 
to work_mem as the default value when running the CREATE INDEX command?


Sorry for the delay.

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-09 Thread Fujii Masao
On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 (2014/09/09 22:17), Fujii Masao wrote:

 On Tue, Sep 9, 2014 at 1:28 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 I get some compiler warnings on v2 of this patch:

 reloptions.c:219: warning: excess elements in struct initializer
 reloptions.c:219: warning: (near initialization for 'intRelOpts[15]')


 Attached is the updated version of the patch.


 Thank you for updating the patch!

 I took a quick review on the patch.  It looks good to me,

Thanks for reviewing the patch!

 but one thing I'm
 concerned about is

 You wrote:
 The attached patch introduces the GIN index storage parameter
 PENDING_LIST_CLEANUP_SIZE which specifies the maximum size of
 GIN pending list. If it's not set, work_mem is used as that maximum
 size,
 instead. So this patch doesn't break the existing application which
 currently uses work_mem as the threshold of cleanup operation of
 the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.

 As you mentioned, I think it's important to consider for the existing
 applications, but I'm wondering if it would be a bit confusing users to have
 two parameters,

Yep.

 PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting.
 Wouldn't it be easy-to-use to have only one parameter,
 PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE to
 work_mem as the default value when running the CREATE INDEX command?

That's an idea. But there might be some users who want to change
the cleanup size per session like they can do by setting work_mem,
and your idea would prevent them from doing that...

So what about introduing pending_list_cleanup_size also as GUC?
That is, users can set the threshold by using either the reloption or
GUC.

 Sorry for the delay.

No problem. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-09 Thread Michael Paquier
On Tue, Sep 9, 2014 at 3:46 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 One regression failed on linux due to spacing issue which is
 fixed in attached patch.
I just read the latest patch by curiosity, wouldn't it make more sense
to split the patch into two different patches for clarity: one for the
reclaimer worker centralized around BgMoveBuffersToFreelist and a
second for the pg_stat portion? Those seem two different features.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog and replication slots

2014-09-09 Thread Michael Paquier
On Mon, Sep 8, 2014 at 8:50 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Sep 3, 2014 at 11:40 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Aug 31, 2014 at 9:45 AM, Magnus Hagander mag...@hagander.net wrote:
 Do we really want those Asserts? There is not a single Assert in
 bin/pg_basebackup today - as is the case for most things in bin/. We
 typically use regular if statements for things that can happen, and
 just ignore the others I think - since the callers are fairly simple
 to trace.

 I have no opinion on whether we want these particular Assert() calls,
 but I note that using Assert() in front-end code only became possible
 in February of 2013, as a result of commit
 e1d25de35a2b1f809e8f8d7b182ce0af004f3ec9.  So the lack of assertions
 there may not be so much because people thought it was a bad idea as
 that it didn't use to work.  Generally, I favor the use of Assert() in
 front-end code in the same scenarios in which we use it in back-end
 code: for checks that shouldn't burden production builds but are
 useful during development.

 Well that was exactly why they have been added first. The assertions
 have been placed in some functions to check for incompatible
 combinations of argument values when those functions are called. I
 don't mind re-adding them if people agree that they make sense. IMO
 they do and they help as well for the development of future utilities
 using replication protocol in src/bin/pg_basebackup as much as the
 refactoring work done on this thread.
Let's keep them then.

 We probably need to adjust the error message as well
 in that case, because it's no longer what's expected, it's what's
 required?
 OK, changed this way.
 The reason for the formulation of the current error message is that it's
 the same across all callsites emitting it to make it easier for
 translators. It used to be more specific at some point and then was
 changed. Since these aren't expected to be hit much I don't really see a
 need to be very detailed.
Re-changed back to that.

 Hm. I'd vote to simplify the code a bit based on the argument that the
 current API only looks at the 3 first columns and does not care about
 the 4th which is the plugin name.
 Why not return all four columns from RunIdentifySystem(), setting plugin
 to NULL if not available. Then the caller can error out.
This seems the cleaner approach, do did so.

New patches taking into account all those comments are attached.
Thanks for the feedback.
Regards,
-- 
Michael
From 6a864d1ef18a01e6680fceab181b78d266063200 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 1 Sep 2014 20:48:43 +0900
Subject: [PATCH 1/2] Refactoring of pg_basebackup utilities

Code duplication is reduced with the introduction of new APIs for each
individual replication command:
- IDENTIFY_SYSTEM
- CREATE_REPLICATION_SLOT
- DROP_REPLICATION_SLOT
A couple of variables used to identify a timeline ID are changed as well
to be more consistent with core code.
---
 src/bin/pg_basebackup/pg_basebackup.c  |  21 +
 src/bin/pg_basebackup/pg_receivexlog.c |  49 +++
 src/bin/pg_basebackup/pg_recvlogical.c | 132 +++-
 src/bin/pg_basebackup/streamutil.c | 153 +
 src/bin/pg_basebackup/streamutil.h |  10 +++
 5 files changed, 208 insertions(+), 157 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 8b9acea..0ebda9a 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1569,8 +1569,8 @@ BaseBackup(void)
 {
 	PGresult   *res;
 	char	   *sysidentifier;
-	uint32		latesttli;
-	uint32		starttli;
+	TimeLineID	latesttli;
+	TimeLineID	starttli;
 	char	   *basebkp;
 	char		escaped_label[MAXPGPATH];
 	char	   *maxrate_clause = NULL;
@@ -1624,23 +1624,8 @@ BaseBackup(void)
 	/*
 	 * Run IDENTIFY_SYSTEM so we can get the timeline
 	 */
-	res = PQexec(conn, IDENTIFY_SYSTEM);
-	if (PQresultStatus(res) != PGRES_TUPLES_OK)
-	{
-		fprintf(stderr, _(%s: could not send replication command \%s\: %s),
-progname, IDENTIFY_SYSTEM, PQerrorMessage(conn));
+	if (!RunIdentifySystem(conn, sysidentifier, latesttli, NULL, NULL))
 		disconnect_and_exit(1);
-	}
-	if (PQntuples(res) != 1 || PQnfields(res)  3)
-	{
-		fprintf(stderr,
-_(%s: could not identify system: got %d rows and %d fields, expected %d rows and %d or more fields\n),
-progname, PQntuples(res), PQnfields(res), 1, 3);
-		disconnect_and_exit(1);
-	}
-	sysidentifier = pg_strdup(PQgetvalue(res, 0, 0));
-	latesttli = atoi(PQgetvalue(res, 0, 1));
-	PQclear(res);
 
 	/*
 	 * Start the actual backup
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index a8b9ad3..cd7a41b 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -253,21 +253,10 @@ FindStreamingStart(uint32 *tli)
 static void
 StreamLog(void)
 {
-	PGresult   *res;
 	

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-09 Thread Amit Kapila
On Wed, Sep 10, 2014 at 2:18 AM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Sep 9, 2014 at 1:18 PM, Petr Jelinek p...@2ndquadrant.com wrote:
  I think that's completely wrong.  As the patch series demonstrates,
  it's not limited to propagating ErrorResponse and NoticeResponse.  It
  can also propagate NotifyResponse and RowDescription and DataRow and
  anything else that comes along.  We are not just propagating errors;
  we are propagating all protocol messages of whatever type.  So tying
  it to elog specifically is not right.
 
  Oh in that case, I think what Andres proposed is actually quite good. I
know
  the hook works fine it just seems like using somewhat hackish solution
to
  save 20 lines of code.

 If it's 20 lines of code, I'm probably fine to go that way.  Let's see
 if we can figure out what those 20 lines look like.

 libpq.h exports 29 functions that do a variety of different things.
 Of those, 20 are in pqcomm.c and the others are in be-secure.c.  I
 presume that pluggability for the latter group, if needed at all, is a
 separate project.  The remaining ones break down like this:

 - StreamServerPort, StreamConnection, StreamClose, and
 TouchSocketFiles are intended to be called only from the postmaster,
 to set up and tear down the listening socket and individual
 connections.  Presumably this is not what we care about here.
 - pq_getbytes(), pq_getstring(), pq_getmessage(), pq_getbyte(),
 pq_peekbyte(), and pq_getbyte_if_available() handle *reading* data
 from the socket.  Since you previously agreed that we didn't need to
 build two-way communication on top of this, I would thank that would
 mean that these don't need to be pluggable either.  But maybe I'm
 wrong.
 - The remaining functions are pq_init(), pq_comm_reset(), pq_flush(),
 pq_flush_if_writable(), pq_is_send_pending(), pq_putmessage(),
 pq_putmessage_noblock(), pq_startcopyout(), and pq_endcopyout().
 These are the ones that I think are potentially interesting.

 I didn't choose to provide hooks for all of these in the submitted
 patch because they're not all needed for I want to do here:
 pq_startcopyout() and pq_endcopyout() are only needed for V2 protocol
 support, which did not interest me (nor did COPY, really);
 pq_putmessage_noblock(), pq_flush_if_writable(), and
 pq_is_send_pending() are only used for the walsender protocol, which
 doesn't seem useful to redirect to a non-socket; and I just didn't
 happen to have any use for pq_init() or pq_comm_reset().  Hence what I
 ended up with.

 But, I could revisit that.  Suppose I provide a structure with 10
 function pointers for all ten of those functions, or maybe 9 since
 pq_init() is called so early that it's not likely we'll have control
 to put the hooks in place before that point, and anyway whatever code
 installs the hooks can do its own initialization then.

Can we use pq_init() to install function pointers?
Let us say that it will take COMM_METHOD (tcp, shm_mq) as
input and then install function pointers based on communication
method.  We can call this from main function of bgworker (in case
of patch from pg_background_worker_main()) with COMM_METHOD
as shm_mq and BackendInitialize() will pass it as tcp.


  Then make a
 global variable like pqSendMethods and #define pq_comm_reset() to be
 pqSendMethods-comm_reset(), pflush() to be pqSendMethods-flush(),
 and so on for all 9 or 10 methods.  Then the pqmq code could just
 change pqSendMethods to point to its own method structure instead of
 the default one.  Would that address the concern this concern?  It's
 more than 20 lines of code, but it's not TOO bad.

This idea seems to be better than directly using hooks, however I
don't see any harm in defining pqReceiveMethods for get API's as
well because it can make the whole layer extendable.  Having said
that I think as currently there is no usage of it, so we can leave it
for another patch as well.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com