Re: [HACKERS] Help with finding checkpoint code

2002-08-31 Thread J. R. Nield

It is called by a special child process of the postmaster after a
signal. Search for PMSIGNAL_DO_CHECKPOINT in xlog.c and in postmaster.c.
The checkpoint process gets started out of sigusr1_handler().


On Sat, 2002-08-31 at 23:27, Bruce Momjian wrote:
 I am trying to find when WAL log files are rotated.  The message is:
 
   2002-02-11 21:18:13 DEBUG: recycled transaction log file 0005
 
 and it is printed in MoveOfflineLogs(), and MoveOfflineLogs() is only
 called by CreateCheckPoint(), but I can't see where CreateCheckPoint()
 is called in normal operation.  I see it called by CHECKPOINT, and on
 startup and shutdown, and from bootstrap, but where is it called during
 normal backend operation.
 
 -- 
   Bruce Momjian|  http://candle.pha.pa.us
   [EMAIL PROTECTED]   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
 
 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Database Caching

2002-08-25 Thread J. R. Nield

I'm not sure about query result caching or 'relation caching', since the
first would seem to run into problems with concurrent updates, and the
second is sort-of what the buffer cache does.

Query plan caching sounds like a really good idea though. Neil Conway's
PREPARE patch already does this for an individual backend. Do you think
it would be hard to make it use shared memory, and check if a query has
already been prepared by another backend? Maybe it could use something
like a whitespace insensitive checksum for a shared hash key.

Regards,

John Nield

On Sun, 2002-08-25 at 20:15, Bruce Momjian wrote:
 
 Do we want to add query caching to the TODO list, perhaps with a
 question mark?
 
 ---
 
 Greg Sabino Mullane wrote:
[snip]
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-08 Thread J. R. Nield

On Wed, 2002-08-07 at 23:41, Tom Lane wrote:
 J. R. Nield [EMAIL PROTECTED] writes:
  The xlog code must allow us to force an advance to the next log file,
  and truncate the archived file when it's copied so as not to waste
  space.
 
 Uh, why?  Why not just force a checkpoint and remember the exact
 location of the checkpoint within the current log file?

If I do a backup with PITR and save it to tape, I need to be able to
restore it even if my machine is destroyed in a fire, and all the logs
since the end of a backup are destroyed. If we don't allow the user to
force a log advance, how will he do this? I don't want to copy the log
file, and then have the original be written to later, because it will
become confusing as to which log file to use.

Is the complexity really that big of a problem with this?

 
 When and if you roll back to a prior checkpoint, you'd want to start the
 system running forward with a new xlog file, I think (compare what
 pg_resetxlog does).  But it doesn't follow that you MUST force an xlog
 file boundary simply because you're taking a backup.
 
  This complicates both the recovery logic and XLogInsert, and I'm trying
  to kill the last latent bug in that feature now.
 
 Indeed.  How about keeping it simple, instead?
 
   regards, tom lane
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread J. R. Nield

Ok. This is what I wanted to hear, but I had assumed someone decided to
put it in for a reason, and I wasn't going to submit a patch to pull-out
the local buffer manager without clearing it first.

The main area where it seems to get heavy use is during index builds,
and for 'CREATE TABLE AS SELECT...'.

So I will remove the local buffer manager as part of the PITR patch,
unless there is further objection.

On Fri, 2002-08-02 at 00:49, Tom Lane wrote:
 J. R. Nield [EMAIL PROTECTED] writes:
  I am working on a way to do this with a signal, using holdoffs around
  calls into the storage-manager and VFS layers to prevent re-entrant
  calls. The local buffer manager is simple enough that it should be
  possible to flush them from within a signal handler at most times, but
  the VFS and storage manager are not safe to re-enter from a handler.
 
  Does this sound like a good idea?
 
 No.  What happened to simple?
 
 Before I'd accept anything like that, I'd rip out the local buffer
 manager and just do everything in the shared manager.  I've never
 seen any proof that the local manager buys any noticeable performance
 gain anyway ... how many people really do anything much with a table
 during its first transaction of existence?
 
   regards, tom lane
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread J. R. Nield

On Fri, 2002-08-02 at 10:01, Tom Lane wrote:
 
 Just out of curiosity, though, what does it matter?  On re-reading your
 message I think you are dealing with a non problem, or at least the
 wrong problem.  Local relations do not need to be checkpointed, because
 by definition they were created by a transaction that hasn't committed
 yet.  They must be, and are, checkpointed to disk before the transaction
 commits; but up till that time, if you have a crash then the entire
 relation should just go away.

What happens when we have a local file that is created before the
backup, and it becomes global during the backup?

In order to copy this file, I either need:

1) A copy of all its blocks at the time backup started (or later), plus
all log records between then and the end of the backup.

OR

2) All the log records from the time the local file was created until
the end of the backup.

In the case of an idle uncommitted transaction that suddenly commits
during backup, case 2 might be very far back in the log file. In fact,
the log file might be archived to tape by then.

So I must do case 1, and checkpoint the local relations.


This brings up the question: why do I need to bother backing up files
that were local before the backup started, but became global during the
backup.

We already know that for the backup to be consistent after we restore
it, we must play the logs forward to the completion of the backup to
repair our fuzzy copies of the database files. Since the transaction
that makes the local-file into a global one has committed during our
backup, its log entries will be played forward as well.

What would happen if a transaction with a local relation commits during
backup, and there are log entries inserting the catalog tuples into
pg_class. Should I not apply those on restore? How do I know?

 
 That mechanism is there already --- perhaps it needs a few tweaks for
 PITR but I do not see any need for cross-backend flush commands for
 local relations.
 

This problem is subtle, and I'm maybe having difficulty explaining it
properly. Do you understand the issue I'm raising? Have I made some kind
of blunder, so that this is really not a problem? 

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread J. R. Nield

On Fri, 2002-08-02 at 13:50, Richard Tucker wrote:
 pg_copy does not handle local relations as you would suspect.  To find the
 tables and indexes to backup the backend in processing the ALTER SYSTEM
 BACKUP statement reads the pg_class table.  Any tables in the process of
 coming into existence of course are not visible.  If somehow they were then
 the backup would backup up their contents.  Any in private memory changes
 would be captured during crash recovery on the copy of the database.  So the
 question is: is it possible to read the names of the local relations from
 the pg_class table even though there creation has not yet been committed?
 -regards
 richt
 
No, not really. At least not a consistent view.

The way to do this is using the filesystem to discover the relfilnodes,
and there are a couple of ways to deal with the problem of files being
pulled out from under you, but you have to be careful about what the
buffer manager does when a file gets dropped.

The predicate for files we MUST (fuzzy) copy is: 
  File exists at start of backup  File exists at end of backup

Any other file, while it may be copied, doesn't need to be in the backup
because either it will be created and rebuilt during play-forward
recovery, or it will be deleted during play-forward recovery, or both,
assuming those operations are logged. They really must be logged to do
what we want to do.

Also, you can't use the normal relation_open stuff, because local
relations will not have a catalog entry, and it looks like there are
catcache/sinval issues that I haven't completely covered. So you've got
to do 'blind reads' through the buffer manager, which involves a minor
extension to the buffer manager to support this if local relations go
through the shared buffers, or coordinating with the local buffer
manager if they continue to work as they do now, which involves major
changes.

We also have to checkpoint at the start, and flush the log at the end.
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread J. R. Nield

On Fri, 2002-08-02 at 16:01, Tom Lane wrote:
 J. R. Nield [EMAIL PROTECTED] writes:
  The predicate for files we MUST (fuzzy) copy is: 
File exists at start of backup  File exists at end of backup
 
 Right, which seems to me to negate all these claims about needing a
 (horribly messy) way to read uncommitted system catalog entries, do
 blind reads, etc.  What's wrong with just exec'ing tar after having
 done a checkpoint?
 
There is no need to read uncommitted system catalog entries. Just take a
snapshot of the directory to get the OID's. You don't care whether the
get deleted before you get to them, because the log will take care of
that. 

 (In particular, I *strongly* object to using the buffer manager at all
 for reading files for backup.  That's pretty much guaranteed to blow out
 buffer cache.  Use plain OS-level file reads.  An OS directory search
 will do fine for finding what you need to read, too.)

How do you get atomic block copies otherwise?

 
   regards, tom lane
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread J. R. Nield

On Fri, 2002-08-02 at 16:59, Mikheev, Vadim wrote:

 You don't need it.
 As long as whole block is saved in log on first after
 checkpoint (you made before backup) change to block.

I thought half the point of PITR was to be able to turn off pre-image
logging so you can trade potential recovery time for speed without fear
of data-loss. Didn't we have this discussion before?

How is this any worse than a table scan?
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-02 Thread J. R. Nield

Are you sure this is true for all ports? And if so, why would it be
cheaper for the kernel to do it in its buffer manager, compared to us
doing it in ours? This just seems bogus to rely on. Does anyone know
what POSIX has to say about this? 

On Fri, 2002-08-02 at 18:01, Mikheev, Vadim wrote:
   How do you get atomic block copies otherwise?
  
  Eh?  The kernel does that for you, as long as you're reading the
  same-size blocks that the backends are writing, no?
 
 Good point.
 
 Vadim
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] PITR, checkpoint, and local relations

2002-08-01 Thread J. R. Nield

On Thu, 2002-08-01 at 17:14, Bruce Momjian wrote:
 
 J.R needs comments on this.  PITR has problems because local relations
 aren't logged to WAL.  Suggestions?
 
I'm sorry if it wasn't clear. The issue is not that local relations
aren't logged to WAL, they are. The issue is that you can't checkpoint
them. That means if you need a lower bound on the LSN to recover from,
then you either need to wait for transactions using them all to commit
and flush their local buffers, or there needs to be a async way to tell
them all to flush.

I am working on a way to do this with a signal, using holdoffs around
calls into the storage-manager and VFS layers to prevent re-entrant
calls. The local buffer manager is simple enough that it should be
possible to flush them from within a signal handler at most times, but
the VFS and storage manager are not safe to re-enter from a handler.

Does this sound like a good idea?

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



[HACKERS] PITR, checkpoint, and local relations

2002-07-24 Thread J. R. Nield

As per earlier discussion, I'm working on the hot backup issues as part
of the PITR support. While I was looking at the buffer manager and the
relcache/MyDb issues to figure out the best way to work this, it
occurred to me that PITR will introduce a big problem with the way we
handle local relations.

The basic problem is that local relations (rd_myxactonly == true) are
not part of a checkpoint, so there is no way to get a lower bound on the
starting LSN needed to recover a local relation. In the past this did
not matter, because either the local file would be (effectively)
discarded during recovery because it had not yet become visible, or the
file would be flushed before the transaction creating it made it
visible. Now this is a problem.

So I need a decision from the core team on what to do about the local
buffer manager. My preference would be to forget about the local buffer
manager entirely, or if not that then to allow it only for _true_
temporary data. The only alternative I can devise is to create some way
for all other backends to participate in a checkpoint, perhaps using a
signal. I'm not sure this can be done safely. 

Anyway, I'm glad the tuplesort stuff doesn't try to use relation files
:-)

Can the core team let me know if this is acceptable, and whether I
should move ahead with changes to the buffer manager (and some other
stuff) needed to avoid special treatment of rd_myxactonly relations?

Also to Richard: have you guys at multera dealt with this issue already?
Is there some way around this that I'm missing?


Regards,

  John Nield




Just as an example of this problem, imagine the following sequence:

1) Transaction TX1 creates a local relation LR1 which will eventually
become a globally visible table. Tuples are inserted into the local
relation, and logged to the WAL file. Some tuples remain in the local
buffer cache and are not yet written out, although they are logged. TX1
is still in progress.

2) Backup starts, and checkpoint is called to get a minimum starting LSN
(MINLSN) for the backed-up files. Only the global buffers are flushed.

3) Backup process copies LR1 into the backup directory. (postulate some
way of coordinating with the local buffer manager, a problem I have not
solved).

4) TX1 commits and flushes its local buffers. A dirty buffer exists
whose LSN is before MINLSN. LR1 becomes globally visible.

5) Backup finishes copying all the files, including the local relations,
and then flushes the log. The log files between MINLSN and the current
LSN are copied to the backup directory, and backup is done.

6) Sometime later, a system administrator restores the backup and plays
the logs forward starting at MINLSN. LR1 will be corrupt, because some
of the log entries required for its restoration will be before MINLSN.
This corruption will not be detected until something goes wrong.

BTW: The problem doesn't only happen with backup! It occurs at every
checkpoint as well, I just missed it until I started working on the hot
backup issue.

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)

2002-07-18 Thread J. R. Nield

Richard:

I can't quite follow this; maybe you sent a draft by accident. If you
want to post a patch against 7.2.1, or even better against HEAD in CVS,
that would be great. Or if you'd rather point me to your source online,
that would be good too.

I just want to clarify though: is this work released to the PostgreSQL
Development group by Progress and Multera, or do they still claim
copyright interest in it?

Regards,
J.R. Nield


On Thu, 2002-07-18 at 12:56, Richard Tucker wrote:
 
 
 -Original Message-
 From: J. R. Nield [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, July 17, 2002 8:13 PM
 To: [EMAIL PROTECTED]
 Cc: Bruce Momjian
 Subject: RE: [HACKERS] Issues Outstanding for Point In Time Recovery
 (PITR)
 
 
 On Wed, 2002-07-17 at 19:25, Richard Tucker wrote:
  Regarding hot backup.  Our implementation of pg_copy does a hot backup.
  It turns off database checkpointing for the duration of the backup.
 Backups
  all the files of the database cluster up to the wal file currently being
  logged to.  It then acquires the WalInsertLock lock long enough to backup
  the current wal file.
 
 Does it then allow more writes to that WAL file? It would seem like you
 want to advance the log to the next file, so the sysadmin wouldn't have
 to choose which one of log-file number 3 he wants to use at restore.
 
  Writes to the wal file are allowed during the backup except for the
 backing of the wal file current when the
  backup completes.  That is the pg_xlog directory is the last directory to
 be backed up.  The wal_files are backed
  up in the order they were used.  Continued wal file logging is allowed
 until the backup reaches the current wal
  file being written to.  To back up this last wal file the WalInsertLock is
 held until the copy of the wal file
  is complete.  So the backup stops update activity only long enough to copy
 this last 16mb file.
 
 Also, what do you mean by 'turns off checkpointing'. You have to do a
 checkpoint, or at least flush the buffers, when you start the backup.
 Otherwise how do you know what LSN to start from at restore?
 
  The pg_control file also gets backed up.  It contains the point in the log
 at which to begin
  the redo/roll forward.
  By not allowing the redo point to advance while the backup goes on means
 that the startup processes' crash
  recovery code will capture all the changes made to the database cluster
 while the backup was running.
 
 
 Anyway: Yes we'd love to see the code.
 
  In what form would you like me to send the code to you e.g. as a patch,
 copy our whole source ...
 
 Since I've pretty-much got create/drop and index stuff working, if your
 code does the rest then we should be good to go.
 
 ;jrnield
 
 
 --
 J. R. Nield
 [EMAIL PROTECTED]
 
 
 
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)

2002-07-17 Thread J. R. Nield

On Wed, 2002-07-17 at 01:25, Bruce Momjian wrote:
 
 We only patch configure.in.  If you post to hackers, they can give you
 assistance and I will try to help however I can.  I can so some
 configure.in stuff for you myself.

Thanks for the offer. The only thing I was changing it for was to test
whether and how to get a ethernet MAC address using ioctl, so libuuid
could use it if available. That is dropped now.

 
  Related to that, the other place I need advice is on adding Ted Tso's
  LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we
  allowed to use this? There is a free OSF/DCE spec for UUID's, so I can
  re-implement the library if required.
 
 We talked about this on the replication mailing list.  We decided that
 hostname, properly hashed to an integer, was the proper way to get this
 value.  Also, there should be a postgresql.conf variable so you can
 override the hostname-generated value if you wish.  I think that is
 sufficient.

I will do something like this, but reserve 16 bytes for it just in case
we change our minds. It needs to be different among systems on the same
machine, so there needs to be a time value and a pseudo-random part as
well. Also, 'hostname' will likely be the same on many machines
(localhost.localdomain or similar).

The only reason I bothered with UUID's before is because they have a
standard setup to make the possibility of collision extremely small, and
I figured replication will end up using it someday.

 
  We also haven't discussed commands for backup/restore, but I will use
  what I think is appropriate and we can change the grammar if needed. The
  initial hot-backup capability will require the database to be in
  read-only mode and use tar for backup, and I will add the ability to
  allow writes later.
 
 Yea, I saw Tom balked at that. I think we have enough manpower and time
 that we can get hot backup in normal read/write mode working before 7.3
 beta so I would just code it assuming the system is live and we can deal
 with making it hot-capable once it is in CVS.  It doesn't have to work
 100% until beta time.

Hot backup read/write requires that we force an advance in the logfile
segment after the backup. We need to save all the logs between backup
start and completion. Otherwise the files will be useless as a
standalone system if the current logs somehow get destroyed (fire in the
machine room, etc.).

The way I would do this is:

  create a checkpoint
  do the block-by-block walk of the files using the bufmgr
  create a second checkpoint
  force the log to advance past the end of the current segment
  save the log segments containing records between the
   first  second checkpont with the backup

Then if you restore the backup, you can recover to the point of the
second checkpoint, even if the logs since then are all gone.

Right now the log segment size is fixed, so this means that we'd waste
8MB of log space on average to do a backup. Also, the way XLOG reads
records right now, we have to write placeholder records into the empty
space, because that's how it finds the end of the log stream. So I need
to change XLOG to handle skip records, and then to truncate the file
when it gets archived, so we don't have to save up to 16MB of zeros.

Also, if archiving is turned off, then we can't recycle or delete any
logs for the duration of the backup, and we have to save them.

So I'll finish the XLOG support for this, and then think about the
correct way to walk through all the files.
 
-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)

2002-07-16 Thread J. R. Nield

On Tue, 2002-07-16 at 15:36, Bruce Momjian wrote:
 
 J.R., just checking to see how PITR recovery is going.  Do you need any
 assistance or have any questions for us?
 
 Also, do you have any idea how close you are to having something
 completed?  Are you aware we are closing development of 7.3 at the end
 of August and start beta September 1?  Is there any way we can help you?
 

It should be ready to go into CVS by the end of the month. 

That will include: logging all operations except for rtree and GiST,
archival of logfiles (with new postgresql.conf params), headers on the
logfiles to verify the system that created them, standalone backend
recovery to a point-in-time, and a rudimentary hot backup capability.

I could use some advice on the proper way to add tests to configure.in,
given that the autoconf output is in CVS. Would you ever want a patch to
include the generated 'configure' file?

Related to that, the other place I need advice is on adding Ted Tso's
LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we
allowed to use this? There is a free OSF/DCE spec for UUID's, so I can
re-implement the library if required.

We also haven't discussed commands for backup/restore, but I will use
what I think is appropriate and we can change the grammar if needed. The
initial hot-backup capability will require the database to be in
read-only mode and use tar for backup, and I will add the ability to
allow writes later.

Does this sound like a reasonable timeframe/feature-set to make the 7.3
release?

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] pgbench questions

2002-07-13 Thread J. R. Nield

On Sat, 2002-07-13 at 02:25, Tatsuo Ishii wrote: 
  (2) At least in the current CVS version, the code to do a 'CHECKPOINT'
  after creating a table has been #ifdef'ed out. Why is that?
 
 That is not after creation of a table, but while creating it, which is
 not necessary any more since Tom has fix the growth of WAL logs.
 

Tatsou:

Could you or Tom give me some background on what this change was about?
Is this something recent, or would it have been in CVS about a month
ago?

I can't see any reason to force a checkpoint after CREATE TABLE, but it
would be interesting to know why it was done before.

;John Nield

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] please help on query

2002-07-11 Thread J. R. Nield

On Thu, 2002-07-11 at 11:22, Luis Alberto Amigo Navarro wrote:
 I can't improve performance on this query:

Blame Canada!

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



[HACKERS] Issues Outstanding for Point In Time Recovery (PITR)

2002-07-04 Thread J. R. Nield
 corruption, and we restore it from backup, how do 
we know what archived files we need for recovery?

Should file corruption (partial write, bad disk block, etc.) outside 
the system catalog cause us to abort the system, or should we just 
take the relation or database off-line?

Given a backup file, how do we determine the point in the log 
where we should start recovery for the file? What is the highest LSN
we can use that will fully recover the file?

  §5.2 - Proposal: 

Put a file header on each file, and update that header to the last
checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or
at least once every dbsize/'file_lsn_log_slack' megabytes of log
written, where dbsize is the estimated size of the database. Have
these values be settable from the config file. These updates would be
distributed throughout the hour, or interspersed between regular
amounts of log generation.

If we have a database backup program or command, it can update the
header on the file before backup to the greatest value it can assure
to be safe.

  §5.3 - Status:

Waiting for input from pgsql-hackers.

Questions:

There are alternate methods than using a file header to get a
known-good LSN lower bound for the starting point to recover a backup
file. Is this the best way?

A) The Definitions

This stuff is obtuse, but I need it here to keep track of what I'm 
saying. Someday I should use it consistently in the rest of this
document.

system or database system:

A collection of postgres databases in one $PGDATA directory,
managed by one postmaster instance at a time (and having one WAL
log, etc.)

All the files composing such a system, as a group.

up to date or now or current or current LSN:

The most recent durable LSN for the system.

block consistent copy:

When referring to a file:

A copy of a file, which may be written to during the process of
copying, but where each BLCKSZ size block is copied atomically.

When referring to multiple files (in the same system):

A copy of all the files, such that each is independently a block
consistent copy

file consistent copy:

When referring to a file:

A copy of a file that is not written to between the start and end
of the copy operation.

When referring to multiple files (in the same system):

A copy of all the files, such that each is independently a file
consistent copy

system consistent copy:

When referring to a file:

A copy of a file, where the entire system of which it is a member
is not written to during the copy.

When referring to multiple files (in the same system):

A copy of all the files, where the entire system of which they are
members was not written to between the start and end of the
copying of all the files, as a group.

shutdown consistent copy:

When referring to a file:

A copy of a file, where the entire system of which it is a member
had been cleanly shutdown before the start of and for the duration
of the copy.

When referring to multiple files (in the same system):

A copy of all the files, where the entire system of which they are
members had been cleanly shutdown before the start of and for the
duration of the copying of all the files, as a group.

consistent copy:

A block, file, system, or shutdown consistent copy.

known-good LSN lower bound 
or LSN lower bound
or LSN-LB:

When referring to a group of blocks, a file, or a group of files:

An LSN known to be old enough that no log entries before it are needed
to bring the blocks or files up-to-date.

known-good LSN greatest lower bound 
or LSN greatest lower bound 
or LSN-GLB:

When referring to a group of blocks, a file, or a group of files:

The greatest possible LSN that is a known-good LSN lower bound for
the group.

backup file:

A consistent copy of a data file used by the system, for which 
we have a known-good LSN lower bound.

optimal backup file:

A backup file, for which we have the known-good LSN greatest lower
bound.

backup system:



Play-Forward File Recovery or PFFR:

The process of bringing an individual backup file up to date.


-- 
J. R. Nield
[EMAIL PROTECTED]






---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly





Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)

2002-07-04 Thread J. R. Nield

On Thu, 2002-07-04 at 11:45, J. R. Nield wrote:

One other item that should be here:
 The big items so-far are:
 §1 - Logging Relation file creation, truncation, and removal 
 This is mostly done. Can do infinte play-forward from
 online logs.
 §2 - Partial-Write and Bad Block detection
 Need input before starting. Migration issues.
 §3 - Detecting Shutdown Consistent System Recovery
 Mostly done.
 §4 - Interactive Play-Forward Recovery for an Entire System
 Need input before starting.
 §5 - Individual file consistent recovery
 Need input. Semi-Major changes required.
  §6 - btbuild is not logged
  Not logged because of same assumptions as for file create.
  Only need to log the index build parameters to recreate
  the index, not each page change.


-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly





Re: [HACKERS] Vacuum Daemon

2002-06-29 Thread J. R. Nield

On Sat, 2002-06-29 at 20:14, Tom Lane wrote:
 Matthew T. O'Connor [EMAIL PROTECTED] writes:

  Second: There was some discussion 
  (http://archives.postgresql.org/pgsql-hackers/2002-05/msg00970.php) about 
  this not being neede once UNDO is on place, what is the current view on this?
 
 I do not think that is the case; and anyway we've pretty much rejected
 Vadim's notion of going to an Oracle-style UNDO buffer.  I don't foresee
 VACUUM going away anytime soon --- what we need is to make it less
 obtrusive.  7.2 made some progress in that direction, but we need more.
 

Could someone point me to this discussion, or summarize what the problem
was? Was his proposal to keep tuple versions in the UNDO AM, or only
pointers to them?

The referred-to message seems to be about something else.

;jrnield
 
-- 
J. R. Nield
[EMAIL PROTECTED]






---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster





Re: [HACKERS] Vacuum Daemon

2002-06-29 Thread J. R. Nield

On Sat, 2002-06-29 at 21:55, Tom Lane wrote:
 J. R. Nield [EMAIL PROTECTED] writes:
  I do not think that is the case; and anyway we've pretty much rejected
  Vadim's notion of going to an Oracle-style UNDO buffer.
 
  Could someone point me to this discussion, or summarize what the problem
  was?
 
 I'm too lazy to dig through the archives at the moment, but the main
 points were (a) a finite-size UNDO buffer chokes on large transactions
 

Yes this is a good point. Oracle was always lame with its ROLLBACK
SEGMENTS. SolidDB (SolidWorks? It's been a while...) used a btree-like
structure for this that was not of fixed size. Oracle supposedly moved
to the same method in its 9i release, but I don't know the details.

I could never figure out how they did this, until I realized that UNDO
doesn't need to be in the WAL log. You just use any access method you
feel like, and make sure the method is itself protected by REDO. Just
insert REDO entries to protect the insert into the UNDO AM, and REDO log
when you delete. That makes it easy to have the recovery code be
idempotent, to catch the case of a system crash during recovery.

 and (b) the Oracle approach requires live transaction processing to
 do the cleanup work that our approach can push off to hopefully-not-
 time-critical vacuum processing.

I'm not sure which way I'm leaning on this. On the one hand, it requires
extra work to clean up while the system is live, in addition to writing
the undo records, though the cleanup is not necessarily by the same
transaction that committed the work (the cleanup needs to be deferred
until it's out of an active snapshot anyway).

On the other hand, you can clean-up without a full table scan, because
you know which tuples need to be changed. This can be a big advantage on
gigantic tables. Also, it lets you remove deleted tuples quickly, so the
space can be reused, and eliminates the xid wraparound problem.

Of course, any kind of undo is worse for performance with bulk
inserts/updates, so you either end up committing every few thousand
inserts, or you use some special extension to disable undo logging for a
bulk load (or if you really want to be able to roll it back, you live
with it :-)

How slow is it to vacuum a 1 TB database with postgres? Do we even have
any users who could test this?

Also, I would never advocate that we do what I'm pretty sure Oracle
does, and keep old values in the Rollback Segment. Only (RelFileNode,
ItemDataPointer) addresses would need to be kept in the UNDO AM, if we
went this route.

 
 UNDO per se doesn't eliminate VACUUM anyhow; it only reclaims space
 from tuples written by aborted transactions.  If you want to get rid
 of VACUUM then you need another way to get rid of the old versions of
 perfectly good committed tuples that are obsoleted by updates from
 later transactions.  That essentially means you need an overwriting
 storage manager, which is a concept that doesn't mix well with MVCC.

Well, you can keep the UNDO records after commit to do a fast
incremental vacuum as soon as the transaction that deleted the tuples
becomes older than the oldest snapshot. If this is always done whenever
an XID becomes that old, then you never need to vacuum, and you never
need a full table scan.

Because postgres never overwrites (except at vacuum), I think it
actually makes us a BETTER candidate for this to be implemented cleanly
then with an overwriting storage manager. We will never need to keep
tuple values in UNDO!

 
 Oracle found a solution to that conundrum, but it's really not obvious
 to me that their solution is better than ours.

Their approach was worse, because they had an overwriting storage
manager before they tried to implement it (I'm guessing). :-)

 Also, they have
 patents that we'd probably run afoul of if we try to imitate their
 approach too closely.
 

Given the current state of affairs here in the US, PostgreSQL probably
violates hundreds or even thousands of software patents. It probably
violates tens of patents that have been upheld in court. The only thing
keeping companies from shutting down postgres, linux, OpenOffice, and a
hundred other projects is fear of adverse publicity, and the fact that
development would move overseas and continue to be a thorn in their
side. 

We'll see how long this lasts, given the fear some vendors have of
certain maturing open-source/GPL projects, but I don't think PostgreSQL
will be first, since anyone can take this code and become an instant
proprietary database vendor! (No, I'm not complaining. Please, nobody
start a license fight because of this)

-- 
J. R. Nield
[EMAIL PROTECTED]






---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster





Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-24 Thread J. R. Nield

On Sun, 2002-06-23 at 21:29, J. R. Nield wrote:

 If is impossible to do what you want. You can not protect against...
Wow. The number of typo's in that last one was just amazing. I even
started with one.

Have an nice weekend everybody :-)

;jrnield

-- 
J. R. Nield
[EMAIL PROTECTED]






---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org





Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-24 Thread J. R. Nield

On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
 Yes, I don't see writing to two files vs. one to be any win, especially
 when we need to fsync both of them.  What I would really like is to
 avoid the double I/O of writing to WAL and to the data file;  improving
 that would be a huge win.
 

If is impossible to do what you want. You can not protect against
partial writes without writing pages twice and calling fdatasync between
them while going through a generic filesystem. The best disk array will
not protect you if the operating system does not align block writes to
the structure of the underlying device. Even with raw devices, you need
special support or knowledge of the operating system and/or the disk
device to ensure that each write request will be atomic to the
underlying hardware. 

All other systems rely on the fact that you can recover a damaged file
using the log archive. This means downtime in the rare case, but no data
loss. Until PostgreSQL can do this, then it will not be acceptable for
real critical production use. This is not to knock PostgreSQL, because
it is a very good database system, and clearly the best open-source one.
It even has feature advantages over the commercial systems. But at the
end of the day, unless you have complete understanding of the I/O system
from write(2) through to the disk system, the only sure ways to protect
against partial writes are by careful writes (in the WAL log or
elsewhere, writing pages twice), or by requiring (and allowing) users to
do log-replay recovery when a file is corrupted by a partial write. As
long as there is a UPS, and the operating system doesn't crash, then
there still should be no partial writes.

If we log pages to WAL, they are useless when archived (after a
checkpoint). So either we have a separate log for them (the ping-pong
file), or we should at least remove them when archived, which makes log
archiving more complex but is perfectly doable.

Finally, I would love to hear why we are using the operating system
buffer manager at all. The OS is acting as a secondary buffer manager
for us. Why is that? What flaw in our I/O system does this reveal? I
know that:

We sync only WAL, not the other pages, except for the sync() call we do
 during checkpoint when we discard old WAL files.

But this is probably not a good thing. We should only be writing blocks
when they need to be on disk. We should not be expecting the OS to write
them sometime later and avoid blocking (as long) for the write. If we
need that, then our buffer management is wrong and we need to fix it.
The reason we are doing this is because we expect the OS buffer manager
to do asynchronous I/O for us, but then we don't control the order. That
is the reason why we have to call fdatasync(), to create sequence
points.

The reason we have performance problems with either D_OSYNC or fdatasync
on the normal relations is because we have no dbflush process. This
causes an unacceptable amount of I/O blocking by other transactions.

The ORACLE people were not kidding when they said that they could not
certify Linux for production use until it supported O_DSYNC. Can you
explain why that was the case?

Finally, let me apologize if the above comes across as somewhat
belligerent. I know very well that I can't compete with you guys for
knowledge of the PosgreSQL system. I am still at a loss when I look at
the optimizer and executor modules, and it will take some time before I
can follow discussion of that area. Even then, I doubt my ability to
compare with people like Mr. Lane and Mr. Momjian in experience and
general intelligence, or in the field of database programming and
software development in particular. However, this discussion and a
search of the pgsql-hackers archives reveals this problem to be the KEY
area of PostgreSQL's failing, and general misunderstanding, when
compared to its commercial competitors.

Sincerely, 

J. R. Nield

-- 
J. R. Nield
[EMAIL PROTECTED]






---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org





Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-24 Thread J. R. Nield

On Sun, 2002-06-23 at 23:40, Curt Sampson wrote:
 On 23 Jun 2002, J. R. Nield wrote:
 
  If is impossible to do what you want. You can not protect against
  partial writes without writing pages twice and calling fdatasync
  between them while going through a generic filesystem.
 
 I agree with this.
 
  The best disk array will not protect you if the operating system does
  not align block writes to the structure of the underlying device.
 
 This I don't quite understand. Assuming you're using a SCSI drive
 (and this mostly applies to ATAPI/IDE, too), you can do naught but
 align block writes to the structure of the underlying device. When you
 initiate a SCSI WRITE command, you start by telling the device at which
 block to start writing and how many blocks you intend to write. Then you
 start passing the data.
 

All I'm saying is that the entire postgresql block write must be
converted into exactly one SCSI write command in all cases, and I don't
know a portable way to ensure this. 

  Even with raw devices, you need special support or knowledge of the
  operating system and/or the disk device to ensure that each write
  request will be atomic to the underlying hardware.
 
 Well, so here I guess you're talking about two things:
 
 1. When you request, say, an 8K block write, will the OS really
 write it to disk in a single 8K or multiple of 8K SCSI write
 command?
 
 2. Does the SCSI device you're writing to consider these writes to
 be transactional. That is, if the write is interrupted before being
 completed, does the SCSI device guarantee that the partially-sent
 data is not written, and the old data is maintained? And of course,
 does it guarantee that, when it acknowledges a write, that write is
 now in stable storage and will never go away?
 
 Both of these are not hard to guarantee, actually. For a BSD-based OS,
 for example, just make sure that your filesystem block size is the
 same as or a multiple of the database block size. BSD will never write
 anything other than a block or a sequence of blocks to a disk in a
 single SCSI transaction (unless you've got a really odd SCSI driver).
 And for your disk, buy a Baydel or Clarion disk array, or something
 similar.
 
 Given that it's not hard to set up a system that meets these criteria,
 and this is in fact commonly done for database servers, it would seem a
 good idea for postgres to have the option to take advantage of the time
 and money spent and adjust its performance upward appropriately.

I agree with this. My point was only that you need to know what
guarantees your operating system/hardware combination provides on a
case-by-case basis, and there is no standard way for a program to
discover this. Most system administrators are not going to know this
either, unless databases are their main responsibility.

 
  All other systems rely on the fact that you can recover a damaged file
  using the log archive.
 
 Not exactly. For MS SQL Server, at any rate, if it detects a page tear
 you cannot restore based on the log file alone. You need a full or
 partial backup that includes that entire torn block.
 

I should have been more specific: you need a backup of the file from
some time ago, plus all the archived logs from then until the current
log sequence number.

  This means downtime in the rare case, but no data loss. Until
  PostgreSQL can do this, then it will not be acceptable for real
  critical production use.
 
 It seems to me that it is doing this right now. In fact, it's more
 reliable than some commerial systems (such as SQL Server) because it can
 recover from a torn block with just the logfile.

Again, what I meant to say is that the commercial systems can recover
with an old file backup + logs. How old the backup can be depends only
on how much time you are willing to spend playing the logs forward. So
if you do a full backup once a week, and multiplex and backup the logs,
then even if a backup tape gets destroyed you can still survive. It just
takes longer.

Also, postgreSQL can't recover from any other type of block corruption,
while the commercial systems can. That's what I meant by the critical
production use comment, which was sort-of unfair.

So I would say they are equally reliable for torn pages (but not bad
blocks), and the commercial systems let you trade potential recovery
time for not having to write the blocks twice. You do need to back-up
the log archives though.

 
  But at the end of the day, unless you have complete understanding of
  the I/O system from write(2) through to the disk system, the only sure
  ways to protect against partial writes are by careful writes (in
  the WAL log or elsewhere, writing pages twice), or by requiring (and
  allowing) users to do log-replay recovery when a file is corrupted by
  a partial write.
 
 I don't understand how, without a copy of the old data that was in the
 torn block, you can restore that block from just log file entries. Can
 you

Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-24 Thread J. R. Nield

On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
 
 I think you have been missing the point...  
Yes, this appears to be the case. Thanks especially to Curt for clearing
things up for me.

-- 
J. R. Nield
[EMAIL PROTECTED]






---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html





Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread J. R. Nield

On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
 J. R. Nield wrote:
  One other point:
  
  Page pre-image logging is fundamentally the same as what Jim Grey's
  book[1] would call careful writes. I don't believe they should be in
  the XLOG, because we never need to keep the pre-images after we're sure
  the buffer has made it to the disk. Instead, we should have the buffer
  IO routines implement ping-pong writes of some kind if we want
  protection from partial writes.
 
 Ping-pong writes to where?  We have to fsync, and rather than fsync that
 area and WAL, we just do WAL.  Not sure about a win there.
 

The key question is: do we have some method to ensure that the OS
doesn't do the writes in parallel?

If the OS will ensure that one of the two block writes of a ping-pong
completes before the other starts, then we don't need to fsync() at 
all. 

The only thing we are protecting against is the possibility of both
writes being partial. If neither is done, that's fine because WAL will
protect us. If the first write is partial, we will detect that and use
the old data from the other, then recover from WAL. If the first is
complete but the second is partial, then we detect that and use the
newer block from the first write. If the second is complete but the
first is partial, we detect that and use the newer block from the second
write.

So does anyone know a way to prevent parallel writes in one of the
common unix standards? Do they say anything about this?

It would seem to me that if the same process does both ping-pong writes,
then there should be a cheap way to enforce a serial order. I could be
wrong though.

As to where the first block of the ping-pong should go, maybe we could
reserve a file with nBlocks space for them, and write the information
about which block was being written to the XLOG for use in recovery.
There are many other ways to do it.

;jrnield

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread J. R. Nield

On Sun, 2002-06-23 at 11:19, Tom Lane wrote: 
 Curt Sampson [EMAIL PROTECTED] writes:
  This should also allow us to disable completely the ping-pong writes
  if we have a disk subsystem that we trust.
 
 If we have a disk subsystem we trust, we just disable fsync on the
 WAL and the performance issue largely goes away.

It wouldn't work because the OS buffering interferes, and we need those
WAL records on disk up to the greatest LSN of the Buffer we will be writing.


We already buffer WAL ourselves. We also already buffer regular pages.
Whenever we write a Buffer out of the buffer cache, it is because we
really want that page on disk and wanted to start an IO. If thats not
the case, then we should have more block buffers! 

So since we have all this buffering designed especially to meet our
needs, and since the OS buffering is in the way, can someone explain to
me why postgresql would ever open a file without the O_DSYNC flag if the
platform supports it? 



 
 I concur with Bruce: the reason we keep page images in WAL is to
 minimize the number of places we have to fsync, and thus the amount of
 head movement required for a commit.  Putting the page images elsewhere
 cannot be a win AFAICS.


Why not put all the page images in a single pre-allocated file and treat
it as a ring? How could this be any worse than flushing them in the WAL
log? 

Maybe fsync would be slower with two files, but I don't see how
fdatasync would be, and most platforms support that. 

What would improve performance would be to have a dbflush process that
would work in the background flushing buffers in groups and trying to
stay ahead of ReadBuffer requests. That would let you do the temporary
side of the ping-pong as a huge O_DSYNC writev(2) request (or
fdatasync() once) and then write out the other buffers. It would also
tend to prevent the other backends from blocking on write requests. 

A dbflush could also support aio_read/aio_write on platforms like
Solaris and WindowsNT that support it. 

Am I correct that right now, buffers only get written when they get
removed from the free list for reuse? So a released dirty buffer will
sit in the buffer free list until it becomes the Least Recently Used
buffer, and will then cause a backend to block for IO in a call to
BufferAlloc? 

This would explain why we like using the OS buffer cache, and why our
performance is troublesome when we have to do synchronous IO writes, and
why fsync() takes so long to complete. All of the backends block for
each call to BufferAlloc() after a large table update by a single
backend, and then the OS buffers are always full of our written data. 

Am I reading the bufmgr code correctly? I already found an imaginary
race condition there once :-) 

;jnield 


 
  Well, whether or not there's a cheap way depends on whether you consider
  fsync to be cheap. :-)
 
 It's never cheap :-(
 
-- 
J. R. Nield
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-23 Thread J. R. Nield

On Sun, 2002-06-23 at 12:10, Curt Sampson wrote:
 
 So what we would really need to implement, if we wanted to be more
 efficient with trusted disk subsystems, would be the option of writing
 to the log only the changed row or changed part of the row, or writing
 the entire changed page. I don't know how hard this would be
 
We already log that stuff. The page images are in addition to the
Logical Changes, so we could just stop logging the page images.

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE

2002-06-22 Thread J. R. Nield

On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote:
 I was wondering, how does knowing the block is corrupt help MS SQL? 
 Right now, we write changed pages to WAL, then later write them to disk.
 I have always been looking for a way to prevent these WAL writes.  The
 512-byte bit seems interesting, but how does it help?
 
 And how does the bit help them with partial block writes?  Is the bit at
 the end of the block?  Is that reliable?
 

My understanding of this is as follows:

1) On most commercial systems, if you get a corrupted block (from
partial write or whatever) you need to restore the file(s) from the most
recent backup, and replay the log from the log archive (usually only the
damaged files will be written to during replay). 

2) If you can't deal with the downtime to recover the file, then EMC,
Sun, or IBM will sell you an expensive disk array with an NVRAM cache
that will do atomic writes. Some plain-vanilla SCSI disks are also
capable of atomic writes, though usually they don't use NVRAM to do it. 

The database must then make sure that each page-write gets translated
into exactly one SCSI-level write. This is one reason why ORACLE and
Sybase recommend that you use raw disk partitions for high availability.
Some operating systems support this through the filesystem, but it is OS
dependent. I think Solaris 7  8 has support for this, but I'm not sure.

PostgreSQL has trouble because it can neither archive logs for replay,
nor use raw disk partitions.


One other point:

Page pre-image logging is fundamentally the same as what Jim Grey's
book[1] would call careful writes. I don't believe they should be in
the XLOG, because we never need to keep the pre-images after we're sure
the buffer has made it to the disk. Instead, we should have the buffer
IO routines implement ping-pong writes of some kind if we want
protection from partial writes.


Does any of this make sense?



;jrnield


[1] Grey, J. and Reuter, A. (1993). Transaction Processing: Concepts
and Techniques. Morgan Kaufmann.

-- 
J. R. Nield
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster