Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-16 Thread Alvaro Herrera
Alvaro Herrera wrote:

 I see another hole in this area.  See do_start_worker() -- there we only
 consider the offsets limit to determine a database to be in
 almost-wrapped-around state (causing emergency attention).  If the
 database in members trouble has no pgstat entry, it might get completely
 ignored.

For the record -- it was pointed out to me that this was actually fixed
by 53bb309d2.  \o/

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-15 Thread Alvaro Herrera
Andres Freund wrote:

 A first version to address this problem can be found appended to this
 email.
 
 Basically it does:
 * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
   autovacuum once per members segment
 * For both members and offsets, once hitting the hard limits, signal
   autovacuum everytime. Otherwise we loose the information when
   restarting the database, or when autovac is killed. I ran into this a
   bunch of times while testing.

Sounds reasonable.

I see another hole in this area.  See do_start_worker() -- there we only
consider the offsets limit to determine a database to be in
almost-wrapped-around state (causing emergency attention).  If the
database in members trouble has no pgstat entry, it might get completely
ignored.  I think the way to close this hole is to
find_multixact_start() in the autovac launcher for the database with the
oldest datminmxid, to determine whether we need to activate emergency
mode for it.  (Maybe instead of having this logic in autovacuum, it
should be a new function that receives database datminmulti and returns
a boolean indicating whether the database is in trouble or not.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-15 Thread Robert Haas
On Fri, Jun 12, 2015 at 7:27 PM, Steve Kehlet steve.keh...@gmail.com wrote:
 Just wanted to report that I rolled back my VM to where it was with 9.4.2
 installed and it wouldn't start. I installed 9.4.4 and now it starts up just
 fine:

 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG:  database system was shut down
 at 2015-05-27 13:12:55 PDT
 2015-06-12 16:05:58 PDT [6453]: [2-1] LOG:  MultiXact member wraparound
 protections are disabled because oldest checkpointed MultiXact 1 does not
 exist on disk
 2015-06-12 16:05:58 PDT [6457]: [1-1] LOG:  autovacuum launcher started
 2015-06-12 16:05:58 PDT [6452]: [1-1] LOG:  database system is ready to
 accept connections
  done
 server started

 And this is showing up in my serverlog periodically as the emergency
 autovacuums are running:

 2015-06-12 16:13:44 PDT [6454]: [1-1] LOG:  MultiXact member wraparound
 protections are disabled because oldest checkpointed MultiXact 1 does not
 exist on disk

 **Thank you Robert and all involved for the resolution to this.**

 With the fixes introduced in this release, such a situation will result in
 immediate emergency autovacuuming until a correct oldestMultiXid value can
 be determined

 Okay, I notice these vacuums are of the to prevent wraparound type (like
 VACUUM FREEZE), that do hold locks preventing ALTER TABLEs and such. Good to
 know, we'll plan our software updates accordingly.

 Is there any risk until these autovacuums finish?

As long as you see only a modest number of files in
pg_multixact/members, you're OK.  But in theory, until that emergency
autovacuuming finishes, there's nothing keeping that directory from
wrapping around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-12 Thread Steve Kehlet
Just wanted to report that I rolled back my VM to where it was with 9.4.2
installed and it wouldn't start. I installed 9.4.4 and now it starts up
just fine:

 2015-06-12 16:05:58 PDT [6453]: [1-1] LOG:  database system was shut down
at 2015-05-27 13:12:55 PDT
 2015-06-12 16:05:58 PDT [6453]: [2-1] LOG:  MultiXact member wraparound
protections are disabled because oldest checkpointed MultiXact 1 does not
exist on disk
 2015-06-12 16:05:58 PDT [6457]: [1-1] LOG:  autovacuum launcher started
 2015-06-12 16:05:58 PDT [6452]: [1-1] LOG:  database system is ready to
accept connections
  done
 server started

And this is showing up in my serverlog periodically as the emergency
autovacuums are running:

 2015-06-12 16:13:44 PDT [6454]: [1-1] LOG:  MultiXact member wraparound
protections are disabled because oldest checkpointed MultiXact 1 does not
exist on disk

**Thank you Robert and all involved for the resolution to this.**

 With the fixes introduced in this release, such a situation will result
in immediate emergency autovacuuming until a correct oldestMultiXid value
can be determined

Okay, I notice these vacuums are of the to prevent wraparound type (like
VACUUM FREEZE), that do hold locks preventing ALTER TABLEs and such. Good
to know, we'll plan our software updates accordingly.

Is there any risk until these autovacuums finish?


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-11 Thread Jeff Janes
On Wed, Jun 10, 2015 at 7:16 PM, Noah Misch n...@leadboat.com wrote:

 On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote:
  One more thing:
  Our testing infrastructure sucks. Without writing C code it's basically
  impossible to test wraparounds and such. Even if not particularly useful
  for non-devs, I really think we should have functions for creating
  burning xids/multixacts in core. Or at least in some extension.

 +1.  This keeps coming up, so it's worth maintaining a verified and speedy
 implementation.


+1 from me as well.

Also, I've pretty much given up on testing this area myself, because of the
issue pointed out here:

http://www.postgresql.org/message-id/CAMkU=1wbi5afhytawdkawease_mc00i4y_7ojhp1y-8sgci...@mail.gmail.com

I think this is the same issue as part of Andres' point 1.

It is pretty frustrating and futile to test wrap around when the database
doesn't live long enough to wrap around under the high-stress conditions.

I had thought that all changes to ShmemVariableCache except nextXid should
be WAL logged at the time they occur, not just at the next checkpoint.  But
that wouldn't fix the problem, as the change to ShmemVariableCache has to
be transactional with the change to pg_database.  So it would have to be
WAL logged inside the commit record or any transaction which changes
pg_database.

Cheers,

Jeff


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-10 Thread Noah Misch
On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote:
 One more thing:
 Our testing infrastructure sucks. Without writing C code it's basically
 impossible to test wraparounds and such. Even if not particularly useful
 for non-devs, I really think we should have functions for creating
 burning xids/multixacts in core. Or at least in some extension.

+1.  This keeps coming up, so it's worth maintaining a verified and speedy
implementation.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Alvaro Herrera
Robert Haas wrote:
 On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:

  (My personal alarm bells go off when I see autovac_naptime=15min or
  more, but apparently not everybody sees things that way.)
 
 Uh, I'd echo that sentiment if you did s/15min/1min/

Yeah, well, that too I guess.

 I think Andres's patch is just improving the existing mechanism so
 that it's reliable, and you're proposing something notably different
 which might be better, but which is really a different proposal
 altogether.

Fair enough.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Robert Haas
On Mon, Jun 8, 2015 at 1:23 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Andres Freund wrote:
 On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera 
 alvhe...@2ndquadrant.com wrote:
 I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER
 only causes things to happen (i.e. a new worker to be started) when
 autovacuum is disabled.  If autovacuum is enabled, postmaster
 receives the signal and doesn't do anything about it, because the
 launcher is already running.  Of course, regularly scheduled autovac
 workers will eventually start running, but perhaps this is not good
 enough.

 Well that's just the same for the plain xid precedent? I'd not mind
 improving further, but that seems like a separate thing.

 Sure.  I just concern that we might be putting excessive trust on
 emergency workers being launched at a high pace.  With normally
 configured systems (naptime=1min) it shouldn't be a problem, but we have
 seen systems with naptime set to one hour or so, and those might feel
 some pain; and it would get worse the more databases you have, because
 people might feel the need to space the autovac runs even more.

 (My personal alarm bells go off when I see autovac_naptime=15min or
 more, but apparently not everybody sees things that way.)

Uh, I'd echo that sentiment if you did s/15min/1min/

I think Andres's patch is just improving the existing mechanism so
that it's reliable, and you're proposing something notably different
which might be better, but which is really a different proposal
altogether.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Alvaro Herrera
Andres Freund wrote:
 On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera 
 alvhe...@2ndquadrant.com wrote:

 I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER
 only causes things to happen (i.e. a new worker to be started) when
 autovacuum is disabled.  If autovacuum is enabled, postmaster
 receives the signal and doesn't do anything about it, because the
 launcher is already running.  Of course, regularly scheduled autovac
 workers will eventually start running, but perhaps this is not good
 enough.
 
 Well that's just the same for the plain xid precedent? I'd not mind
 improving further, but that seems like a separate thing.

Sure.  I just concern that we might be putting excessive trust on
emergency workers being launched at a high pace.  With normally
configured systems (naptime=1min) it shouldn't be a problem, but we have
seen systems with naptime set to one hour or so, and those might feel
some pain; and it would get worse the more databases you have, because
people might feel the need to space the autovac runs even more.

(My personal alarm bells go off when I see autovac_naptime=15min or
more, but apparently not everybody sees things that way.)

 --- 
 Please excuse brevity and formatting - I am writing this on my mobile phone.

I wonder if these notices are useful at all.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-08 14:23:32 -0300, Alvaro Herrera wrote:
 Sure.  I just concern that we might be putting excessive trust on
 emergency workers being launched at a high pace.

I'm not sure what to do about that. I mean, it'd not be hard to simply
ignore naptime upon wraparound, but I'm not sure that'd be well
received.

 (My personal alarm bells go off when I see autovac_naptime=15min or
 more, but apparently not everybody sees things that way.)

Understandably so. I'd be alarmed at much lower values than that
actually.

  --- 
  Please excuse brevity and formatting - I am writing this on my mobile phone.
 
 I wonder if these notices are useful at all.

I only know that I'm less annoyed at reading a untrimmed/badly wrapped
email if it's sent from a mobile phone, where it's hard to impossible to
write a well formatted email, than when sent from a full desktop.
That's why I added the notice...

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Alvaro Herrera
Andres Freund wrote:

 A first version to address this problem can be found appended to this
 email.
 
 Basically it does:
 * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
   autovacuum once per members segment
 * For both members and offsets, once hitting the hard limits, signal
   autovacuum everytime. Otherwise we loose the information when
   restarting the database, or when autovac is killed. I ran into this a
   bunch of times while testing.

I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER only
causes things to happen (i.e. a new worker to be started) when
autovacuum is disabled.  If autovacuum is enabled, postmaster receives
the signal and doesn't do anything about it, because the launcher is
already running.  Of course, regularly scheduled autovac workers will
eventually start running, but perhaps this is not good enough.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera alvhe...@2ndquadrant.com 
wrote:
Andres Freund wrote:

 A first version to address this problem can be found appended to this
 email.
 
 Basically it does:
 * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
   autovacuum once per members segment
 * For both members and offsets, once hitting the hard limits, signal
   autovacuum everytime. Otherwise we loose the information when
   restarting the database, or when autovac is killed. I ran into this
a
   bunch of times while testing.

I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER
only
causes things to happen (i.e. a new worker to be started) when
autovacuum is disabled.  If autovacuum is enabled, postmaster receives
the signal and doesn't do anything about it, because the launcher is
already running.  Of course, regularly scheduled autovac workers will
eventually start running, but perhaps this is not good enough.

Well that's just the same for the plain xid precedent? I'd not mind improving 
further, but that seems like a separate thing.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-05 20:47:33 +0200, Andres Freund wrote:
 On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
  Robert Haas robertmh...@gmail.com writes:
   1. The problem that we might truncate an SLRU members page away when
   it's in the buffers, but not drop it from the buffers, leading to a
   failure when we try to write it later.
 
 I've got a fix for this, and about three other issues I found during
 development of the new truncation codepath.
 
 I'll commit the fix tomorrow.

I've looked through multixact.c/slru.c and afaics there currently is, as
observed by Thomas, no codepath that exercises the broken behaviour. Due
to the way checkpoints and SLRU truncation are linked problematic
pages will have been flushed beforehand.

I think we should fix this either way as it seems like a bad trap, but
I'd rather commit it after the the next minor releases are out.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-05 16:56:18 -0400, Tom Lane wrote:
 Andres Freund and...@anarazel.de writes:
  On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas robertmh...@gmail.com 
  wrote:
  I think we would be foolish to rush that part into the tree.  We
  probably got here in the first place by rushing the last round of
  fixes too much; let's try not to double down on that mistake.

  My problem with that approach is that I think the code has gotten 
  significantly more complex in the least few weeks. I have very little trust 
  that the interactions between vacuum, the deferred truncations in the 
  checkpointer, the state management in shared memory and recovery are 
  correct. There's just too many non-local subtleties here.

  I don't know what the right thing to do here is.

 My gut feeling is that rushing to make a release date is the wrong thing.

 If we have confidence that we can ship something on Monday that is
 materially more trustworthy than the current releases, then let's aim to
 do that; but let's ship only patches we are confident in.  We can do
 another set of releases later that incorporate additional fixes.  (As some
 wise man once said, there's always another bug.)

I've tortured hardware a fair bit with HEAD. So far it looks much better
than 9.4.2+ et al. I've noticed a bunch of, to me at least, new issues:

1) the autovacuum trigger logic isn't perfect yet. I.e. especially with
  autovacuum=off you can get into situations where emergency vacuums
  aren't started when necessary. This is particularly likely to happen
  if either very large multixacts are used, or if the server has been
  shut down while emergency autovacuum where happening. No corruption
  ensues, but it's not easy to get out of.

2) I've managed to corrupt a cluster when a standby performed
  restartpoints less frequently than the master performed
  checkpoints. Because truncations happen in the checkpointer it's not
  that hard to end up with entirely full multixact slrus. This is a
  problem on several fronts. We can IIUC end up truncating away the
  wrong data, and we can be in a bad state upon promotion.  None of that
  is new.

3) It's really confusing that truncation (and thus the limits in shared
  memory) happens in checkpoints. If you hit a limit and manually do all
  the necessary vacuums you'll see a good limit in
  pg_database.datminmxid, but you'll still into the error. You manually
  have to force a checkpoint for the truncation to actually
  happen. That's particularly problematic because larger installations,
  where I presume wraparound issues are more likely, often have a large
  checkpoint_timeout setting.

Since none of these are really new, I don't think they should prevent us
from doing a back branch release. While I'm still not convinced we're
better of with 9.4.4 than with 9.4.1, we're certainly better of than
with 9.4.[23] et al.

If we want to go ahead with the release I plan to do a bit more testing
today and tomorrow. If not I'm first going to continue working on fixing
the above.

I've a good fix for 1). I'm not 100% sure I'll feel confident with
pushing if we wrap today. I am wondering if we shouldn't at least apply
the portion that unconditionally sends a signal in the ERROR
case. That's still an improvement.


One more thing:
Our testing infrastructure sucks. Without writing C code it's basically
impossible to test wraparounds and such. Even if not particularly useful
for non-devs, I really think we should have functions for creating
burning xids/multixacts in core. Or at least in some extension.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-08 15:15:04 +0200, Andres Freund wrote:
 1) the autovacuum trigger logic isn't perfect yet. I.e. especially with
   autovacuum=off you can get into situations where emergency vacuums
   aren't started when necessary. This is particularly likely to happen
   if either very large multixacts are used, or if the server has been
   shut down while emergency autovacuum where happening. No corruption
   ensues, but it's not easy to get out of.

A first version to address this problem can be found appended to this
email.

Basically it does:
* Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
  autovacuum once per members segment
* For both members and offsets, once hitting the hard limits, signal
  autovacuum everytime. Otherwise we loose the information when
  restarting the database, or when autovac is killed. I ran into this a
  bunch of times while testing.

Regards,

Andres
From 9949d8ce4b69b4fd693da08d8e1854fd259a33a9 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Mon, 8 Jun 2015 13:41:42 +0200
Subject: [PATCH] Improve multixact emergency autovacuum logic.

Previously autovacuum was not necessarily triggered if space in the
members slru got tight. The first problem was that the signalling was
tied to values in the offsets slru, but members can advance much
faster. Thats especially a problem if old sessions had been around that
previously prevented the multixact horizon to increase. Secondly the
skipping logic doesn't work if the database was restarted after
autovacuum was triggered - that knowledge is not preserved across
restart. This is especially a problem because it's a common
panic-reaction to restart the database if it gets slow to
anti-wraparound vacuums.

Fix the first problem by separating the logic for members from
offsets. Trigger autovacuum whenever a multixact crosses a segment
boundary, as the current member offset increases in irregular values, so
we can't use a simple modulo logic as for offsets.  Add a stopgap for
the second problem, by signalling autovacuum whenver ERRORing out
because of boundaries.

Backpatch into 9.3, where it became more likely that multixacts wrap
around.
---
 src/backend/access/transam/multixact.c | 61 +-
 1 file changed, 45 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d3336a8..3bc170d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -980,10 +980,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * Note these are pretty much the same protections in GetNewTransactionId.
 	 *--
 	 */
-	if (!MultiXactIdPrecedes(result, MultiXactState-multiVacLimit) ||
-		!MultiXactState-oldestOffsetKnown ||
-		(MultiXactState-nextOffset - MultiXactState-oldestOffset
-		  MULTIXACT_MEMBER_SAFE_THRESHOLD))
+	if (!MultiXactIdPrecedes(result, MultiXactState-multiVacLimit))
 	{
 		/*
 		 * For safety's sake, we release MultiXactGenLock while sending
@@ -999,19 +996,17 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 		LWLockRelease(MultiXactGenLock);
 
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only once per 64K multis generated.  This still gives
-		 * plenty of chances before we get into real trouble.
-		 */
-		if (IsUnderPostmaster  (result % 65536) == 0)
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
 		if (IsUnderPostmaster 
 			!MultiXactIdPrecedes(result, multiStopLimit))
 		{
 			char	   *oldest_datname = get_database_name(oldest_datoid);
 
+			/*
+			 * Immediately kick autovacuum into action as we're already
+			 * in ERROR territory.
+			 */
+			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
+
 			/* complain even if that DB has disappeared */
 			if (oldest_datname)
 ereport(ERROR,
@@ -1032,6 +1027,14 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 		{
 			char	   *oldest_datname = get_database_name(oldest_datoid);
 
+			/*
+			 * To avoid swamping the postmaster with signals, we issue the autovac
+			 * request only once per 64K multis generated.  This still gives
+			 * plenty of chances before we get into real trouble.
+			 */
+			if (IsUnderPostmaster  (result % 65536) == 0)
+SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
+
 			/* complain even if that DB has disappeared */
 			if (oldest_datname)
 ereport(WARNING,
@@ -1099,6 +1102,10 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	if (MultiXactState-offsetStopLimitKnown 
 		MultiXactOffsetWouldWrap(MultiXactState-offsetStopLimit, nextOffset,
  nmembers))
+	{
+		/* see comment in the corresponding offsets wraparound case */
+		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
+
 		ereport(ERROR,
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
  errmsg(multixact \members\ limit exceeded),
@@ -1109,10 +1116,32 @@ GetNewMultiXactId(int 

Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-05 11:43:45 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch n...@leadboat.com wrote:
  I read through this version and found nothing to change.  I encourage 
  other
  hackers to study the patch, though.  The surrounding code is challenging.

  Andres tested this and discovered that my changes to
  find_multixact_start() were far more creative than intended.
  Committed and back-patched with a trivial fix for that stupidity and a
  novel-length explanation of the changes.

 So where are we on this?  Are we ready to schedule a new set of
 back-branch releases?  If not, what issues remain to be looked at?

 We're currently still doing bad things while the database is in an
 inconsistent state (i.e. read from SLRUs and truncate based on the
 results of that). It's quite easy to reproduce base backup startup
 failures.

 On the other hand, that's not new. And the fix requires, afaics, a new
 type of WAL record (issued very infrequently). I'll post a first version
 of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I
 guess we can then decide what we'd like to do.

There are at least two other known issues that seem like they should
be fixed before we release:

1. The problem that we might truncate an SLRU members page away when
it's in the buffers, but not drop it from the buffers, leading to a
failure when we try to write it later.

2. Thomas's bug fix for another longstanding but that occurs when you
run his checkpoint-segment-boundary.sh script.

I think we might want to try to fix one or both of those before
cutting a new release.  I'm less sold on the idea of installing
WAL-logging in this minor release.  That probably needs to be done,
but right now we've got stuff that worked in early 9.3.X release and
is now broken, and I'm in favor of fixing that first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Steve Kehlet
On Fri, Jun 5, 2015 at 11:47 AM Andres Freund and...@anarazel.de wrote:

 But I'd definitely like some
 independent testing for it, and I'm not sure if that's doable in time
 for the wrap.


I'd be happy to test on my database that was broken, for however much that
helps. It's a VM so I can easily revert back as needed. I'm just losing
track of all the patches, and what's committed and what I need to manually
apply :-). I was about to test what's on REL9_4_STABLE. Let me know if I
should do this.

Thanks so much everyone.


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Jun 5, 2015 at 12:00 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-05 11:43:45 -0400, Tom Lane wrote:
 So where are we on this?  Are we ready to schedule a new set of
 back-branch releases?  If not, what issues remain to be looked at?

 We're currently still doing bad things while the database is in an
 inconsistent state (i.e. read from SLRUs and truncate based on the
 results of that). It's quite easy to reproduce base backup startup
 failures.
 
 On the other hand, that's not new. And the fix requires, afaics, a new
 type of WAL record (issued very infrequently). I'll post a first version
 of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I
 guess we can then decide what we'd like to do.

 There are at least two other known issues that seem like they should
 be fixed before we release:

 1. The problem that we might truncate an SLRU members page away when
 it's in the buffers, but not drop it from the buffers, leading to a
 failure when we try to write it later.

 2. Thomas's bug fix for another longstanding but that occurs when you
 run his checkpoint-segment-boundary.sh script.

 I think we might want to try to fix one or both of those before
 cutting a new release.  I'm less sold on the idea of installing
 WAL-logging in this minor release.  That probably needs to be done,
 but right now we've got stuff that worked in early 9.3.X release and
 is now broken, and I'm in favor of fixing that first.

Okay, but if we're not committing today to a release wrap on Monday,
I don't see it happening till after PGCon.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Alvaro Herrera
Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:

  There are at least two other known issues that seem like they should
  be fixed before we release:
 
  1. The problem that we might truncate an SLRU members page away when
  it's in the buffers, but not drop it from the buffers, leading to a
  failure when we try to write it later.
 
  2. Thomas's bug fix for another longstanding but that occurs when you
  run his checkpoint-segment-boundary.sh script.
 
  I think we might want to try to fix one or both of those before
  cutting a new release.  I'm less sold on the idea of installing
  WAL-logging in this minor release.  That probably needs to be done,
  but right now we've got stuff that worked in early 9.3.X release and
  is now broken, and I'm in favor of fixing that first.
 
 Okay, but if we're not committing today to a release wrap on Monday,
 I don't see it happening till after PGCon.

In that case, I think we should get a release out next week.  The
current situation is rather badly broken and dangerous, and the above
two bugs are nowhere as problematic.  If we can get fixes for these over
the weekend, that would be additional bonus.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  1. The problem that we might truncate an SLRU members page away when
  it's in the buffers, but not drop it from the buffers, leading to a
  failure when we try to write it later.

I've got a fix for this, and about three other issues I found during
development of the new truncation codepath.

I'll commit the fix tomorrow.

  I think we might want to try to fix one or both of those before
  cutting a new release.  I'm less sold on the idea of installing
  WAL-logging in this minor release.  That probably needs to be done,
  but right now we've got stuff that worked in early 9.3.X release and
  is now broken, and I'm in favor of fixing that first.

I've implemented this, and so far it removes more code than it
adds. It's imo also a pretty clear win in how understandable the code
is.  The remaining work, besides testing, is primarily going over lots
of comment and updating them. Some of them are outdated by the patch,
and some already were.

Will post tonight, together with the other fixes, after I get back from
climbing.

My gut feeling right now is that it's a significant improvement, and
that it'll be reasonable to include it. But I'd definitely like some
independent testing for it, and I'm not sure if that's doable in time
for the wrap.

 Okay, but if we're not committing today to a release wrap on Monday,
 I don't see it happening till after PGCon.

I wonder if, with all the recent, err, training, we could wrap it on
Tuesday instead. Independent of the truncation rework going in or not,
an additional work day to go over all the changes and do some more
testing would be good from my POV.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On 2015-06-05 11:43:45 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch n...@leadboat.com wrote:
  I read through this version and found nothing to change.  I encourage other
  hackers to study the patch, though.  The surrounding code is challenging.
 
  Andres tested this and discovered that my changes to
  find_multixact_start() were far more creative than intended.
  Committed and back-patched with a trivial fix for that stupidity and a
  novel-length explanation of the changes.
 
 So where are we on this?  Are we ready to schedule a new set of
 back-branch releases?  If not, what issues remain to be looked at?

We're currently still doing bad things while the database is in an
inconsistent state (i.e. read from SLRUs and truncate based on the
results of that). It's quite easy to reproduce base backup startup
failures.

On the other hand, that's not new. And the fix requires, afaics, a new
type of WAL record (issued very infrequently). I'll post a first version
of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I
guess we can then decide what we'd like to do.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Joshua D. Drake


On 06/05/2015 01:56 PM, Tom Lane wrote:


If we have confidence that we can ship something on Monday that is
materially more trustworthy than the current releases, then let's aim to
do that; but let's ship only patches we are confident in.  We can do
another set of releases later that incorporate additional fixes.  (As some
wise man once said, there's always another bug.)

If what you're saying is that you don't trust the already-committed patch
very much, then maybe we'd better hold off another couple weeks for more
review and testing.

regards, tom lane



I believe there are likely quite a few parties willing to help test, if 
we knew how?


Sincerely,

jD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing I'm offended is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 2:36 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:

  There are at least two other known issues that seem like they should
  be fixed before we release:

  1. The problem that we might truncate an SLRU members page away when
  it's in the buffers, but not drop it from the buffers, leading to a
  failure when we try to write it later.

  2. Thomas's bug fix for another longstanding but that occurs when you
  run his checkpoint-segment-boundary.sh script.

  I think we might want to try to fix one or both of those before
  cutting a new release.  I'm less sold on the idea of installing
  WAL-logging in this minor release.  That probably needs to be done,
  but right now we've got stuff that worked in early 9.3.X release and
  is now broken, and I'm in favor of fixing that first.

 Okay, but if we're not committing today to a release wrap on Monday,
 I don't see it happening till after PGCon.

 In that case, I think we should get a release out next week.  The
 current situation is rather badly broken and dangerous, and the above
 two bugs are nowhere as problematic.  If we can get fixes for these over
 the weekend, that would be additional bonus.

Yeah, I think I agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas robertmh...@gmail.com 
 wrote:
 I think we would be foolish to rush that part into the tree.  We
 probably got here in the first place by rushing the last round of
 fixes too much; let's try not to double down on that mistake.

 My problem with that approach is that I think the code has gotten 
 significantly more complex in the least few weeks. I have very little trust 
 that the interactions between vacuum, the deferred truncations in the 
 checkpointer, the state management in shared memory and recovery are correct. 
 There's just too many non-local subtleties here. 

 I don't know what the right thing to do here is.

My gut feeling is that rushing to make a release date is the wrong thing.

If we have confidence that we can ship something on Monday that is
materially more trustworthy than the current releases, then let's aim to
do that; but let's ship only patches we are confident in.  We can do
another set of releases later that incorporate additional fixes.  (As some
wise man once said, there's always another bug.)

If what you're saying is that you don't trust the already-committed patch
very much, then maybe we'd better hold off another couple weeks for more
review and testing.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Alvaro Herrera
Joshua D. Drake wrote:

 I believe there are likely quite a few parties willing to help test, if we
 knew how?

The code involved is related to checkpoints, pg_basebackups that take a
long time to run, and multixact freezing and truncation.  If you can set
up test servers that eat lots of multixacts(*), then have many multixact
freezes and truncations occur, that would probably hit the right spots.
(You can set very frequent freezing by lowering
vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age
settings.  Perhaps changing multixact_freeze_max_age would lead to other
interesting results too.  Truncation occurs during checkpoint, some time
after freezing, so it's probably good that those are frequent too.)

Also, pg_upgrade prior to 9.3.4 is able to produce database with
invalid oldestMulti=1, if you start from a 9.2-or-earlier database that
has already consumed some number of multis.  It would be good to test
starting from those, too, just to make sure the mechanism that deals
with that is good.  There are at least two variations: those that have
nextMulti larger than 65k but less than 2 billion, and those that have
nextMulti closer to 4 billion.  (I think a 9.2 database with nextMulti
less than 65k is uninteresting, because the resulting oldestMulti=1 is
the correct value there.)

(*) Thomas Munro posted a sample program that does that; I believe with
minimal changes you could turn it into infinite looping instead of a
pre-set number of iteration.  Also, perhaps it's possible to come up
with programs that consume multixacts even faster than that, and that
create larger multixacts too.  All variations are useful.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  1. The problem that we might truncate an SLRU members page away when
  it's in the buffers, but not drop it from the buffers, leading to a
  failure when we try to write it later.

 I've got a fix for this, and about three other issues I found during
 development of the new truncation codepath.

 I'll commit the fix tomorrow.

OK.  Then I think we should release next week, so we get the fixes we
have out before PGCon.  The current situation is not good.

  I think we might want to try to fix one or both of those before
  cutting a new release.  I'm less sold on the idea of installing
  WAL-logging in this minor release.  That probably needs to be done,
  but right now we've got stuff that worked in early 9.3.X release and
  is now broken, and I'm in favor of fixing that first.

 I've implemented this, and so far it removes more code than it
 adds. It's imo also a pretty clear win in how understandable the code
 is.  The remaining work, besides testing, is primarily going over lots
 of comment and updating them. Some of them are outdated by the patch,
 and some already were.

 Will post tonight, together with the other fixes, after I get back from
 climbing.

 My gut feeling right now is that it's a significant improvement, and
 that it'll be reasonable to include it. But I'd definitely like some
 independent testing for it, and I'm not sure if that's doable in time
 for the wrap.

I think we would be foolish to rush that part into the tree.  We
probably got here in the first place by rushing the last round of
fixes too much; let's try not to double down on that mistake.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas robertmh...@gmail.com 
wrote:
On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund and...@anarazel.de
wrote:
 On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  1. The problem that we might truncate an SLRU members page away
when
  it's in the buffers, but not drop it from the buffers, leading to
a
  failure when we try to write it later.

 I've got a fix for this, and about three other issues I found during
 development of the new truncation codepath.

 I'll commit the fix tomorrow.

OK.  Then I think we should release next week, so we get the fixes we
have out before PGCon.  The current situation is not good.

  I think we might want to try to fix one or both of those before
  cutting a new release.  I'm less sold on the idea of installing
  WAL-logging in this minor release.  That probably needs to be
done,
  but right now we've got stuff that worked in early 9.3.X release
and
  is now broken, and I'm in favor of fixing that first.

 I've implemented this, and so far it removes more code than it
 adds. It's imo also a pretty clear win in how understandable the code
 is.  The remaining work, besides testing, is primarily going over
lots
 of comment and updating them. Some of them are outdated by the patch,
 and some already were.

 Will post tonight, together with the other fixes, after I get back
from
 climbing.

 My gut feeling right now is that it's a significant improvement, and
 that it'll be reasonable to include it. But I'd definitely like some
 independent testing for it, and I'm not sure if that's doable in time
 for the wrap.

I think we would be foolish to rush that part into the tree.  We
probably got here in the first place by rushing the last round of
fixes too much; let's try not to double down on that mistake.

My problem with that approach is that I think the code has gotten significantly 
more complex in the least few weeks. I have very little trust that the 
interactions between vacuum, the deferred truncations in the checkpointer, the 
state management in shared memory and recovery are correct. There's just too 
many non-local subtleties here. 

I don't know what the right thing to do here is.



--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 4:40 PM, Andres Freund and...@anarazel.de wrote:
I think we would be foolish to rush that part into the tree.  We
probably got here in the first place by rushing the last round of
fixes too much; let's try not to double down on that mistake.

 My problem with that approach is that I think the code has gotten 
 significantly more complex in the least few weeks. I have very little trust 
 that the interactions between vacuum, the deferred truncations in the 
 checkpointer, the state management in shared memory and recovery are correct. 
 There's just too many non-local subtleties here.

 I don't know what the right thing to do here is.

That may be true, but we don't need to get to perfect to be better
than 9.4.2 and 9.4.3, where some people can't start the database.

I will grant you that, if the patch I committed today introduces some
regression that is even worse, life will suck.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general