Re: [HACKERS] [PATCHES] WAL logging freezing

2006-11-01 Thread Simon Riggs
On Tue, 2006-10-31 at 11:04 -0500, Tom Lane wrote:

 It seems that we're converging on the conclusion that not truncating
 clog early is the least bad alternative.  This has the advantage of
 making things a lot simpler --- we won't need to track minxid at all.
 Allow me to summarize what I think has to happen:
 
 * VACUUM will determine a freeze cutoff XID the same way it does now,
 except that instead of using a hard-wired freeze window of 1G
 transactions, we'll either reduce the window to (say) 100M transactions
 or provide a GUC variable that can be adjusted over some reasonable
 range.
 
 * All XIDs present in the table that are older than the cutoff XID will
 be replaced by FrozenXid or InvalidXid as required, and such actions
 will be WAL-logged.  (I think we need to consider all 3 of xmin, xmax,
 and xvac here.)
 
 * On successful completion, the cutoff XID is stored in
 pg_class.relvacuumxid, and pg_database.datvacuumxid is updated
 if appropriate.  (The minxid columns are now useless, but unless there
 is another reason to force initdb before 8.2, I'm inclined to leave them
 there and unused.  We can remove 'em in 8.3.)
 
 * pg_clog is truncated according to the oldest pg_database.datvacuumxid.
 We should WAL-log this action, because replaying such an entry will
 allow a PITR slave to truncate its own clog and thereby avoid wraparound
 conflicts.  Note that we no longer need a checkpoint before truncating
 --- what we need is XLogFlush, instead.  (WAL before data)
 
 These changes get us back into the regime where the hint bits truly are
 hints, because the underlying pg_clog data is still there, both in a
 master database and in a PITR slave.  So we don't need to worry about
 WAL-logging hint bits.  We also avoid needing any flushes/fsyncs or
 extra checkpoints.  The added WAL volume should be pretty minimal,
 because only tuples that have gone untouched for a long time incur extra
 work.  The added storage space for pg_clog could be annoying for a small
 database, but reducing the freeze window ameliorates that objection.

Complete agreement, nice summary.

Do we need another GUC? I thought your observation about a PITR slave
having that set lower than its master still remains unresolved. Perhaps
we should do that by pg_config_manual.h, so its much less likely that
the two would be out of step?

Thanks to Heikki, for spotting the original bug before it was reported
in production.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-11-01 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Do we need another GUC? I thought your observation about a PITR slave
 having that set lower than its master still remains unresolved.

No, AFAICS that's not an issue in this design.  The facts-on-the-ground
are whatever is recorded in pg_class.relvacuumxid, and whether a
particular table has been vacuumed with a shorter or longer freeze
window doesn't affect correctness.  In particular, a slave with
ambitions towards having a shorter window would still be unable to
truncate its clog before having re-vacuumed everything.  

So, not only could we have a GUC variable, but it could be USERSET;
there's no breakage risk as long as we constrain the value range to
something sane.

It strikes me that VACUUM FREEZE could be replaced by
SET vacuum_freeze_limit = 0
VACUUM ...
which would be a good thing because the FREEZE keyword has to be
partially reserved in this syntax, and that is contrary to spec.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-31 Thread Heikki Linnakangas

Tom Lane wrote:

The only alternative I can see is the one Heikki suggested: don't
truncate clog until the freeze horizon.  That's safe (given the planned
change to WAL-log tuple freezing) and clean and simple, but a permanent
requirement of 250MB+ for pg_clog would put the final nail in the coffin
of PG's usability in small-disk-footprint environments.  So I don't like
it much.  I suppose it could be made more tolerable by reducing the
freeze horizon, say to 100M instead of 1G transactions.  Anyone for a
GUC parameter?  In a high-volume DB you'd want the larger setting to
minimize the amount of tuple freezing work.  OTOH it seems like making
this configurable creates a nasty risk for PITR situations: a slave
that's configured with a smaller freeze window than the master is
probably not safe.


If we go down that route, we really should make it a GUC parameter, and 
reduce the default at least for 8_1_STABLE.


I got another idea. If we make sure that vacuum removes any aborted xid 
older than OldestXmin from the table, we can safely assume that any xid 
 the current clog truncation point we are going to be interested in is 
committed. Vacuum already removes any tuple with an aborted xmin. If we 
also set any aborted xmax (and xvac) to InvalidXid, and WAL logged that, 
we would know that after vacuum commits, any xid  relvacuumxid in the 
vacuumed table was committed, regardless of the hint bits. We could then 
safely truncate the clog without flushing anything. This also seems safe 
for PITR.


The only performance hit would be the clearing of xmax values of aborted 
transactions, but that doesn't seem too bad to me because most 
transactions commit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-31 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I got another idea. If we make sure that vacuum removes any aborted xid 
 older than OldestXmin from the table, we can safely assume that any xid 
  the current clog truncation point we are going to be interested in is 
 committed. Vacuum already removes any tuple with an aborted xmin. If we 
 also set any aborted xmax (and xvac) to InvalidXid, and WAL logged that, 

The problem with that is all the extra WAL log volume it creates.  I'm
also concerned about the loss of forensic information --- xmax values
are frequently useful for inferring what's been going on in a database.
(This is another reason for not wanting a very short freeze interval BTW.)

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-31 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Simon Riggs wrote:
 Ouch! We did discuss that also. Flushing the buffercache is nasty with
 very large caches, so this makes autovacuum much less friendly - and
 could take a seriously long time if you enforce the vacuum delay
 costings.

 Hmm, isn't the buffer cache aware of a vacuum operation?

Yeah.  What would probably happen is that we'd dump off most of the
dirtied pages to the kernel, which would likely still have a lot of them
in kernel buffers pending write.  But then we'd have to fsync the table
--- so a physical write storm would ensue, which we have no way to
throttle.

I think the don't-truncate-clog approach is a much better answer.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-31 Thread Tom Lane
It seems that we're converging on the conclusion that not truncating
clog early is the least bad alternative.  This has the advantage of
making things a lot simpler --- we won't need to track minxid at all.
Allow me to summarize what I think has to happen:

* VACUUM will determine a freeze cutoff XID the same way it does now,
except that instead of using a hard-wired freeze window of 1G
transactions, we'll either reduce the window to (say) 100M transactions
or provide a GUC variable that can be adjusted over some reasonable
range.

* All XIDs present in the table that are older than the cutoff XID will
be replaced by FrozenXid or InvalidXid as required, and such actions
will be WAL-logged.  (I think we need to consider all 3 of xmin, xmax,
and xvac here.)

* On successful completion, the cutoff XID is stored in
pg_class.relvacuumxid, and pg_database.datvacuumxid is updated
if appropriate.  (The minxid columns are now useless, but unless there
is another reason to force initdb before 8.2, I'm inclined to leave them
there and unused.  We can remove 'em in 8.3.)

* pg_clog is truncated according to the oldest pg_database.datvacuumxid.
We should WAL-log this action, because replaying such an entry will
allow a PITR slave to truncate its own clog and thereby avoid wraparound
conflicts.  Note that we no longer need a checkpoint before truncating
--- what we need is XLogFlush, instead.  (WAL before data)

These changes get us back into the regime where the hint bits truly are
hints, because the underlying pg_clog data is still there, both in a
master database and in a PITR slave.  So we don't need to worry about
WAL-logging hint bits.  We also avoid needing any flushes/fsyncs or
extra checkpoints.  The added WAL volume should be pretty minimal,
because only tuples that have gone untouched for a long time incur extra
work.  The added storage space for pg_clog could be annoying for a small
database, but reducing the freeze window ameliorates that objection.

Comments?  Anyone see any remaining holes?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-31 Thread Simon Riggs
On Mon, 2006-10-30 at 20:40 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  That was understood; in the above example I agree you need to flush. If
  you don't pass a truncation point, you don't need to flush whether or
  not you actually truncate. So we don't need to flush *every* time,
 
 OK, but does that actually do much of anything for your performance
 complaint?  Just after GlobalXmin has passed a truncation point, *every*
 vacuum the system does will start performing a flush-n-fsync, which
 seems like exactly what you didn't like.  If the syncs were spread out
 in time for different rels then maybe this idea would help, but AFAICS
 they won't be.

Makes sense, so we shouldn't do it that way after all.

Are you OK with the other patches I've submitted? My understanding was
that you're gonna have a look at those and this general area? I don't
want to hold up the release because of a PITR patch.

Feedback welcome ;-)

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-31 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 The added WAL volume should be pretty minimal, because only tuples that have
 gone untouched for a long time incur extra work.

That seems like a weak point in the logic. It seems like it would make VACUUM
which is already an i/o hog even more so. Perhaps something clever can be done
with vacuum_cost_delay and commit_siblings.

Something like inserting the delay between WAL logging and syncing the log and
writing to the heap. So if another transaction commits in the meantime we can
skip the extra fsync and continue.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-31 Thread Alvaro Herrera
Gregory Stark wrote:
 
 Tom Lane [EMAIL PROTECTED] writes:
 
  The added WAL volume should be pretty minimal, because only tuples that have
  gone untouched for a long time incur extra work.
 
 That seems like a weak point in the logic. It seems like it would make VACUUM
 which is already an i/o hog even more so. Perhaps something clever can be done
 with vacuum_cost_delay and commit_siblings.
 
 Something like inserting the delay between WAL logging and syncing the log and
 writing to the heap. So if another transaction commits in the meantime we can
 skip the extra fsync and continue.

Huh, but the log would not be flushed for each operation that the vacuum
logs.  Only when it's going to commit.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-31 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Huh, but the log would not be flushed for each operation that the vacuum
 logs.  Only when it's going to commit.

It strikes me that the vacuum cost delay feature omits to consider
generation of WAL records as a cost factor.  It may not be a big problem
though, as long as we can limit the number of records created to one or
two per page --- then you can see it as just a component of the dirtied
a page cost.  If we made a separate WAL record for each tuple then it
could be important to account for.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Simon Riggs
On Mon, 2006-10-30 at 12:05 -0500, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Ugh.  Is there another solution to this?  Say, sync the buffer so that
  the hint bits are written to disk?
 
 Yeah.  The original design for all this is explained by the notes for
 TruncateCLOG:
 
  * When this is called, we know that the database logically contains no
  * reference to transaction IDs older than oldestXact.However, we must
  * not truncate the CLOG until we have performed a checkpoint, to ensure
  * that no such references remain on disk either; else a crash just after
  * the truncation might leave us with a problem.
 
 The pre-8.2 coding is actually perfectly safe within a single database,
 because TruncateCLOG is only called at the end of a database-wide
 vacuum, and so the checkpoint is guaranteed to have flushed valid hint
 bits for all tuples to disk.  There is a risk in other databases though.
 I think that in the 8.2 structure the equivalent notion must be that
 VACUUM has to flush and fsync a table before it can advance the table's
 relminxid.

Ouch! We did discuss that also. Flushing the buffercache is nasty with
very large caches, so this makes autovacuum much less friendly - and
could take a seriously long time if you enforce the vacuum delay
costings.

ISTM we only need to flush iff the clog would be truncated when we
update relminxid. Otherwise we are safe to update even if we crash,
since the clog will not have been truncated. 

 That still leaves us with the problem of hint bits not being updated
 during WAL replay.  I think the best solution for this is for WAL replay
 to force relvacuumxid to equal relminxid (btw, these field names seem
 poorly chosen, and the comment in catalogs.sgml isn't self-explanatory...)
 rather than adopting the value shown in the WAL record.  This probably
 is best done by abandoning the generic overwrite tuple WAL record type
 in favor of something specific to minxid updates.  The effect would then
 be that a PITR slave would not truncate its clog beyond the freeze
 horizon until it had performed a vacuum of its own.

Sounds good. Methinks we do still need the TruncateCLOG patch to ensure
we do WAL replay for the truncation? I'm posting that now to -patches as
a prototype.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com

Index: src/backend/access/transam/clog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.40
diff -c -r1.40 clog.c
*** src/backend/access/transam/clog.c	4 Oct 2006 00:29:49 -	1.40
--- src/backend/access/transam/clog.c	30 Oct 2006 14:32:14 -
***
*** 68,74 
  
  static int	ZeroCLOGPage(int pageno, bool writeXlog);
  static bool CLOGPagePrecedes(int page1, int page2);
! static void WriteZeroPageXlogRec(int pageno);
  
  
  /*
--- 68,74 
  
  static int	ZeroCLOGPage(int pageno, bool writeXlog);
  static bool CLOGPagePrecedes(int page1, int page2);
! static void WriteClogXlogRec(int pageno, int rectype);
  
  
  /*
***
*** 198,204 
  	slotno = SimpleLruZeroPage(ClogCtl, pageno);
  
  	if (writeXlog)
! 		WriteZeroPageXlogRec(pageno);
  
  	return slotno;
  }
--- 198,204 
  	slotno = SimpleLruZeroPage(ClogCtl, pageno);
  
  	if (writeXlog)
! 		WriteClogXlogRec(pageno, CLOG_ZEROPAGE);
  
  	return slotno;
  }
***
*** 338,343 
--- 338,345 
  	/* Perform a CHECKPOINT */
  	RequestCheckpoint(true, false);
  
+ 	WriteClogXlogRec(cutoffPage, CLOG_TRUNCATE);
+ 
  	/* Now we can remove the old CLOG segment(s) */
  	SimpleLruTruncate(ClogCtl, cutoffPage);
  }
***
*** 375,389 
   * (Besides which, this is normally done just before entering a transaction.)
   */
  static void
! WriteZeroPageXlogRec(int pageno)
  {
  	XLogRecData rdata;
  
  	rdata.data = (char *) (pageno);
  	rdata.len = sizeof(int);
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
! 	(void) XLogInsert(RM_CLOG_ID, CLOG_ZEROPAGE | XLOG_NO_TRAN, rdata);
  }
  
  /*
--- 377,393 
   * (Besides which, this is normally done just before entering a transaction.)
   */
  static void
! WriteClogXlogRec(int pageno, int rectype)
  {
  	XLogRecData rdata;
  
+ Assert(rectype == CLOG_ZEROPAGE || rectype == CLOG_TRUNCATE);
+ 
  	rdata.data = (char *) (pageno);
  	rdata.len = sizeof(int);
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
! 	(void) XLogInsert(RM_CLOG_ID, rectype | XLOG_NO_TRAN, rdata);
  }
  
  /*
***
*** 409,414 
--- 413,432 
  
  		LWLockRelease(CLogControlLock);
  	}
+ else if (info == CLOG_TRUNCATE)
+ {
+ 		int			pageno;
+ 
+ 		memcpy(pageno, XLogRecGetData(record), sizeof(int));
+ 
+ 		LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
+ 
+ 	SimpleLruTruncate(ClogCtl, pageno);
+ 
+ 		LWLockRelease(CLogControlLock);
+ }
+ else
+ 		elog(PANIC, clog_redo: unknown op code %u, info);
  }
  
  void

Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Alvaro Herrera
Simon Riggs wrote:
 On Mon, 2006-10-30 at 12:05 -0500, Tom Lane wrote:
  Alvaro Herrera [EMAIL PROTECTED] writes:
   Ugh.  Is there another solution to this?  Say, sync the buffer so that
   the hint bits are written to disk?
  
  Yeah.  The original design for all this is explained by the notes for
  TruncateCLOG:
  
   * When this is called, we know that the database logically contains no
   * reference to transaction IDs older than oldestXact.  However, we must
   * not truncate the CLOG until we have performed a checkpoint, to ensure
   * that no such references remain on disk either; else a crash just after
   * the truncation might leave us with a problem.
  
  The pre-8.2 coding is actually perfectly safe within a single database,
  because TruncateCLOG is only called at the end of a database-wide
  vacuum, and so the checkpoint is guaranteed to have flushed valid hint
  bits for all tuples to disk.  There is a risk in other databases though.
  I think that in the 8.2 structure the equivalent notion must be that
  VACUUM has to flush and fsync a table before it can advance the table's
  relminxid.
 
 Ouch! We did discuss that also. Flushing the buffercache is nasty with
 very large caches, so this makes autovacuum much less friendly - and
 could take a seriously long time if you enforce the vacuum delay
 costings.
 
 ISTM we only need to flush iff the clog would be truncated when we
 update relminxid. Otherwise we are safe to update even if we crash,
 since the clog will not have been truncated. 

I don't understand.  When clog is actually going to be truncated, if
it's determined that there's any page that can be truncated, then a
checkpoint is forced.  If no page is going to be removed then there's no
checkpoint, which makes a lot of sense and of course avoids the problem
of useless flushes.

In fact I don't understand what's the point about multiple databases vs.
a single database.  Surely a checkpoint would flush all buffers in all
databases, no?  This would flush all hint bits, everywhere.  So this bug
does not really exist.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 ISTM we only need to flush iff the clog would be truncated when we
 update relminxid.

Wrong :-(  If the relvacuumxid change (not relminxid ... as I said, these
names aren't very transparent) makes it to disk but not all the hint
bits do, you're at risk.  Crash, restart, vacuum some other table, and
*now* the global min vacuumxid advances.  The fact that we're
WAL-logging the relvacuumxid change makes this scenario exceedingly
probable, if no action is taken to force out the hint bits.

The only alternative I can see is the one Heikki suggested: don't
truncate clog until the freeze horizon.  That's safe (given the planned
change to WAL-log tuple freezing) and clean and simple, but a permanent
requirement of 250MB+ for pg_clog would put the final nail in the coffin
of PG's usability in small-disk-footprint environments.  So I don't like
it much.  I suppose it could be made more tolerable by reducing the
freeze horizon, say to 100M instead of 1G transactions.  Anyone for a
GUC parameter?  In a high-volume DB you'd want the larger setting to
minimize the amount of tuple freezing work.  OTOH it seems like making
this configurable creates a nasty risk for PITR situations: a slave
that's configured with a smaller freeze window than the master is
probably not safe.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 In fact I don't understand what's the point about multiple databases vs.
 a single database.  Surely a checkpoint would flush all buffers in all
 databases, no?

Yeah --- all the ones that are dirty *now*.  Consider the case where you
vacuum DB X, update its datvacuumxid, and don't checkpoint because the
global min didn't advance.  Now you crash, possibly leaving some hint
bits unwritten; but the datvacuumxid change did make it to disk.  After
restart, vacuum DB Y, update its datvacuumxid, and find that the global
min *did* advance.  You checkpoint, and that guarantees that DB Y is
clean for the clog truncation.  But DB X isn't.

The 8.2 changes have created the equivalent risk at the level of each
individual table.  We can't write a vacuumxid change unless we are sure
that the hint-bit changes it promises are actually down to disk.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Simon Riggs
On Mon, 2006-10-30 at 16:58 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  ISTM we only need to flush iff the clog would be truncated when we
  update relminxid.
 
 Wrong :-(  If the relvacuumxid change (not relminxid ... as I said, these
 names aren't very transparent) makes it to disk but not all the hint
 bits do, you're at risk.  Crash, restart, vacuum some other table, and
 *now* the global min vacuumxid advances.  The fact that we're
 WAL-logging the relvacuumxid change makes this scenario exceedingly
 probable, if no action is taken to force out the hint bits.

I don't agree: If the truncation points are at 1 million, 2 million etc,
then if we advance the relvacuumxid from 1.2 million to 1.5 million,
then crash, the hints bits for that last vacuum are lost. Sounds bad,
but we have not truncated clog, so there is no danger. In order to
truncate up to 2 million we need to re-vacuum; at that point we discover
that the 1.5 million setting was wrong, realise it should have been 1.2
million but don't care because we now set it to 1.8 million etc. No
problem, even with repeated crashes. We only flush when we move the
counter past a truncation point.

If you look at this another way, maybe you'll see what I'm saying: Only
update relvacuumxid iff the update would allow us to truncate the clog.
That way we leap forwards in 1 million Xid chunks, rounded down. No
change to clog = no update = no danger that we need to flush to avoid.

 The only alternative I can see is the one Heikki suggested: don't
 truncate clog until the freeze horizon.  That's safe (given the planned
 change to WAL-log tuple freezing) and clean and simple, but a permanent
 requirement of 250MB+ for pg_clog would put the final nail in the coffin
 of PG's usability in small-disk-footprint environments.  So I don't like
 it much.  I suppose it could be made more tolerable by reducing the
 freeze horizon, say to 100M instead of 1G transactions.  Anyone for a
 GUC parameter?  In a high-volume DB you'd want the larger setting to
 minimize the amount of tuple freezing work.  OTOH it seems like making
 this configurable creates a nasty risk for PITR situations: a slave
 that's configured with a smaller freeze window than the master is
 probably not safe.

If we need to, just put the CLOG seg size in pg_config_manual.h

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 I don't agree: If the truncation points are at 1 million, 2 million etc,
 then if we advance the relvacuumxid from 1.2 million to 1.5 million,
 then crash, the hints bits for that last vacuum are lost. Sounds bad,
 but we have not truncated clog, so there is no danger.

You're still wrong though.  Suppose that VACUUM moves a particular rel's
relvacuumxid from 1.9 to 2.1 million, but because this rel is not
currently the oldest vacuumxid, it doesn't truncate clog.  Then we crash
and lose hint bits, but not the relvacuumxid change.  Then VACUUM
vacuums some other rel and advances its relvacuumxid from 1.9 to 2.1
million --- but this time that *was* the globally oldest value, and now
we think we can truncate clog at 2 million.  But the first rel might
still have some unhinted xids around 1.9 million.

 If you look at this another way, maybe you'll see what I'm saying: Only
 update relvacuumxid iff the update would allow us to truncate the clog.

Then you'll never update it at all, because there will always be some
other rel constraining the global min.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Simon Riggs
On Mon, 2006-10-30 at 19:18 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I don't agree: If the truncation points are at 1 million, 2 million etc,
  then if we advance the relvacuumxid from 1.2 million to 1.5 million,
  then crash, the hints bits for that last vacuum are lost. Sounds bad,
  but we have not truncated clog, so there is no danger.
 
 You're still wrong though. 

Frequently, I'd say :-)

  Suppose that VACUUM moves a particular rel's
 relvacuumxid from 1.9 to 2.1 million, but because this rel is not
 currently the oldest vacuumxid, it doesn't truncate clog.  Then we crash
 and lose hint bits, but not the relvacuumxid change.  Then VACUUM
 vacuums some other rel and advances its relvacuumxid from 1.9 to 2.1
 million --- but this time that *was* the globally oldest value, and now
 we think we can truncate clog at 2 million.  But the first rel might
 still have some unhinted xids around 1.9 million.

That was understood; in the above example I agree you need to flush. If
you don't pass a truncation point, you don't need to flush whether or
not you actually truncate. So we don't need to flush *every* time, so
IMHO we don't need to play safe and keep clog the size of an iceberg.

Anyway, if PITR is safe again, I'd like to sleepzz

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-30 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 That was understood; in the above example I agree you need to flush. If
 you don't pass a truncation point, you don't need to flush whether or
 not you actually truncate. So we don't need to flush *every* time,

OK, but does that actually do much of anything for your performance
complaint?  Just after GlobalXmin has passed a truncation point, *every*
vacuum the system does will start performing a flush-n-fsync, which
seems like exactly what you didn't like.  If the syncs were spread out
in time for different rels then maybe this idea would help, but AFAICS
they won't be.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-27 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 [I've just coded the relcache invalidation WAL logging patch also.]

What?  That doesn't make any sense to me.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-27 Thread Simon Riggs
On Fri, 2006-10-27 at 12:01 -0400, Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  I think it's premature to start writing
  patches until we've decided how this really needs to work.
 
  Not logging hint-bit updates seems safe to me. As long as we have the 
  clog, the hint-bit is just a hint. The problem with freezing is that 
  after freezing tuples, the corresponding clog page can go away.
 
 Actually clog can go away much sooner than that, at least in normal
 operation --- that's what datvacuumxid is for, to track where we can
 truncate clog.  

So we definitely have a nasty problem here.

VACUUM FREEZE is just a loaded gun right now.

 Maybe it's OK to say that during WAL replay we keep it
 all the way back to the freeze horizon, but I'm not sure how we keep the
 system from wiping clog it still needs right after switching to normal
 operation.  Maybe we should somehow not xlog updates of datvacuumxid?

Thinking...

Also, we should probably be setting all the hint bits for pages during
recovery then, so we don't need to re-write them again later.

 Another thing I'm concerned about is the scenario where a PITR
 hot-standby machine tracks a master over a period of more than 4 billion
 transactions.  I'm not sure what will happen in the slave's pg_clog
 directory, but I'm afraid it won't be good :-(

I think we'll need to error-out at that point, plus produce messages
when we pass 2 billion transactions recovered. It makes sense to produce
a new base backup regularly anyway.

We'll also need to produce an error message on the primary server so
that we take a new base backup every 2 billion transactions.

There are better solutions, but I'm not sure it makes sense to try and
fix them right now, since that could well delay the release. If we think
it is a necessary fix for the 8.2 line then we could get a better fix
into 8.2.1

[I've just coded the relcache invalidation WAL logging patch also.]

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] WAL logging freezing

2006-10-27 Thread Simon Riggs
On Fri, 2006-10-27 at 22:19 +0100, Simon Riggs wrote:

 So we definitely have a nasty problem here.
 
 VACUUM FREEZE is just a loaded gun right now.
 
  Maybe it's OK to say that during WAL replay we keep it
  all the way back to the freeze horizon, but I'm not sure how we keep the
  system from wiping clog it still needs right after switching to normal
  operation.  Maybe we should somehow not xlog updates of datvacuumxid?
 
 Thinking...

Suggestions:

1. Create a new Utility rmgr that can issue XLOG_UTIL_FREEZE messages
for each block that has had any tuples frozen on it during normal
VACUUMs. We need log only the relid, blockid and vacuum's xid to redo
the freeze operation.

2. VACUUM FREEZE need not generate any additional WAL records, but will
do an immediate sync following execution and before clog truncation.
That way the large number of changed blocks will all reach disk before
we do the updates to the catalog.

3. We don't truncate the clog during WAL replay, so the clog will grow
during recovery. Nothing to do there to make things safe.

4. When InArchiveRecovery we should set all of the datminxid and
datvacuumxid fields to be the Xid from where recovery started, so that
clog is not truncated soon after recovery. Performing a VACUUM FREEZE
after a recovery would be mentioned as an optional task at the end of a
PITR recovery on a failover/second server.

5. At 3.5 billion records during recovery we should halt the replay, do
a full database scan to set hint bits, truncate clog, then restart
replay. (Automatically within the recovery process).

6. During WAL replay, put out a warning message every 1 billion rows
saying that a hint bit scan will eventually be required if recovery
continues.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend