Re: [HACKERS] Help with finding checkpoint code
It is called by a special child process of the postmaster after a signal. Search for PMSIGNAL_DO_CHECKPOINT in xlog.c and in postmaster.c. The checkpoint process gets started out of sigusr1_handler(). On Sat, 2002-08-31 at 23:27, Bruce Momjian wrote: I am trying to find when WAL log files are rotated. The message is: 2002-02-11 21:18:13 DEBUG: recycled transaction log file 0005 and it is printed in MoveOfflineLogs(), and MoveOfflineLogs() is only called by CreateCheckPoint(), but I can't see where CreateCheckPoint() is called in normal operation. I see it called by CHECKPOINT, and on startup and shutdown, and from bootstrap, but where is it called during normal backend operation. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Database Caching
I'm not sure about query result caching or 'relation caching', since the first would seem to run into problems with concurrent updates, and the second is sort-of what the buffer cache does. Query plan caching sounds like a really good idea though. Neil Conway's PREPARE patch already does this for an individual backend. Do you think it would be hard to make it use shared memory, and check if a query has already been prepared by another backend? Maybe it could use something like a whitespace insensitive checksum for a shared hash key. Regards, John Nield On Sun, 2002-08-25 at 20:15, Bruce Momjian wrote: Do we want to add query caching to the TODO list, perhaps with a question mark? --- Greg Sabino Mullane wrote: [snip] -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] PITR, checkpoint, and local relations
On Wed, 2002-08-07 at 23:41, Tom Lane wrote: J. R. Nield [EMAIL PROTECTED] writes: The xlog code must allow us to force an advance to the next log file, and truncate the archived file when it's copied so as not to waste space. Uh, why? Why not just force a checkpoint and remember the exact location of the checkpoint within the current log file? If I do a backup with PITR and save it to tape, I need to be able to restore it even if my machine is destroyed in a fire, and all the logs since the end of a backup are destroyed. If we don't allow the user to force a log advance, how will he do this? I don't want to copy the log file, and then have the original be written to later, because it will become confusing as to which log file to use. Is the complexity really that big of a problem with this? When and if you roll back to a prior checkpoint, you'd want to start the system running forward with a new xlog file, I think (compare what pg_resetxlog does). But it doesn't follow that you MUST force an xlog file boundary simply because you're taking a backup. This complicates both the recovery logic and XLogInsert, and I'm trying to kill the last latent bug in that feature now. Indeed. How about keeping it simple, instead? regards, tom lane -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] PITR, checkpoint, and local relations
Ok. This is what I wanted to hear, but I had assumed someone decided to put it in for a reason, and I wasn't going to submit a patch to pull-out the local buffer manager without clearing it first. The main area where it seems to get heavy use is during index builds, and for 'CREATE TABLE AS SELECT...'. So I will remove the local buffer manager as part of the PITR patch, unless there is further objection. On Fri, 2002-08-02 at 00:49, Tom Lane wrote: J. R. Nield [EMAIL PROTECTED] writes: I am working on a way to do this with a signal, using holdoffs around calls into the storage-manager and VFS layers to prevent re-entrant calls. The local buffer manager is simple enough that it should be possible to flush them from within a signal handler at most times, but the VFS and storage manager are not safe to re-enter from a handler. Does this sound like a good idea? No. What happened to simple? Before I'd accept anything like that, I'd rip out the local buffer manager and just do everything in the shared manager. I've never seen any proof that the local manager buys any noticeable performance gain anyway ... how many people really do anything much with a table during its first transaction of existence? regards, tom lane -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PITR, checkpoint, and local relations
On Fri, 2002-08-02 at 10:01, Tom Lane wrote: Just out of curiosity, though, what does it matter? On re-reading your message I think you are dealing with a non problem, or at least the wrong problem. Local relations do not need to be checkpointed, because by definition they were created by a transaction that hasn't committed yet. They must be, and are, checkpointed to disk before the transaction commits; but up till that time, if you have a crash then the entire relation should just go away. What happens when we have a local file that is created before the backup, and it becomes global during the backup? In order to copy this file, I either need: 1) A copy of all its blocks at the time backup started (or later), plus all log records between then and the end of the backup. OR 2) All the log records from the time the local file was created until the end of the backup. In the case of an idle uncommitted transaction that suddenly commits during backup, case 2 might be very far back in the log file. In fact, the log file might be archived to tape by then. So I must do case 1, and checkpoint the local relations. This brings up the question: why do I need to bother backing up files that were local before the backup started, but became global during the backup. We already know that for the backup to be consistent after we restore it, we must play the logs forward to the completion of the backup to repair our fuzzy copies of the database files. Since the transaction that makes the local-file into a global one has committed during our backup, its log entries will be played forward as well. What would happen if a transaction with a local relation commits during backup, and there are log entries inserting the catalog tuples into pg_class. Should I not apply those on restore? How do I know? That mechanism is there already --- perhaps it needs a few tweaks for PITR but I do not see any need for cross-backend flush commands for local relations. This problem is subtle, and I'm maybe having difficulty explaining it properly. Do you understand the issue I'm raising? Have I made some kind of blunder, so that this is really not a problem? -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] PITR, checkpoint, and local relations
On Fri, 2002-08-02 at 13:50, Richard Tucker wrote: pg_copy does not handle local relations as you would suspect. To find the tables and indexes to backup the backend in processing the ALTER SYSTEM BACKUP statement reads the pg_class table. Any tables in the process of coming into existence of course are not visible. If somehow they were then the backup would backup up their contents. Any in private memory changes would be captured during crash recovery on the copy of the database. So the question is: is it possible to read the names of the local relations from the pg_class table even though there creation has not yet been committed? -regards richt No, not really. At least not a consistent view. The way to do this is using the filesystem to discover the relfilnodes, and there are a couple of ways to deal with the problem of files being pulled out from under you, but you have to be careful about what the buffer manager does when a file gets dropped. The predicate for files we MUST (fuzzy) copy is: File exists at start of backup File exists at end of backup Any other file, while it may be copied, doesn't need to be in the backup because either it will be created and rebuilt during play-forward recovery, or it will be deleted during play-forward recovery, or both, assuming those operations are logged. They really must be logged to do what we want to do. Also, you can't use the normal relation_open stuff, because local relations will not have a catalog entry, and it looks like there are catcache/sinval issues that I haven't completely covered. So you've got to do 'blind reads' through the buffer manager, which involves a minor extension to the buffer manager to support this if local relations go through the shared buffers, or coordinating with the local buffer manager if they continue to work as they do now, which involves major changes. We also have to checkpoint at the start, and flush the log at the end. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] PITR, checkpoint, and local relations
On Fri, 2002-08-02 at 16:01, Tom Lane wrote: J. R. Nield [EMAIL PROTECTED] writes: The predicate for files we MUST (fuzzy) copy is: File exists at start of backup File exists at end of backup Right, which seems to me to negate all these claims about needing a (horribly messy) way to read uncommitted system catalog entries, do blind reads, etc. What's wrong with just exec'ing tar after having done a checkpoint? There is no need to read uncommitted system catalog entries. Just take a snapshot of the directory to get the OID's. You don't care whether the get deleted before you get to them, because the log will take care of that. (In particular, I *strongly* object to using the buffer manager at all for reading files for backup. That's pretty much guaranteed to blow out buffer cache. Use plain OS-level file reads. An OS directory search will do fine for finding what you need to read, too.) How do you get atomic block copies otherwise? regards, tom lane -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] PITR, checkpoint, and local relations
On Fri, 2002-08-02 at 16:59, Mikheev, Vadim wrote: You don't need it. As long as whole block is saved in log on first after checkpoint (you made before backup) change to block. I thought half the point of PITR was to be able to turn off pre-image logging so you can trade potential recovery time for speed without fear of data-loss. Didn't we have this discussion before? How is this any worse than a table scan? -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] PITR, checkpoint, and local relations
Are you sure this is true for all ports? And if so, why would it be cheaper for the kernel to do it in its buffer manager, compared to us doing it in ours? This just seems bogus to rely on. Does anyone know what POSIX has to say about this? On Fri, 2002-08-02 at 18:01, Mikheev, Vadim wrote: How do you get atomic block copies otherwise? Eh? The kernel does that for you, as long as you're reading the same-size blocks that the backends are writing, no? Good point. Vadim -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] PITR, checkpoint, and local relations
On Thu, 2002-08-01 at 17:14, Bruce Momjian wrote: J.R needs comments on this. PITR has problems because local relations aren't logged to WAL. Suggestions? I'm sorry if it wasn't clear. The issue is not that local relations aren't logged to WAL, they are. The issue is that you can't checkpoint them. That means if you need a lower bound on the LSN to recover from, then you either need to wait for transactions using them all to commit and flush their local buffers, or there needs to be a async way to tell them all to flush. I am working on a way to do this with a signal, using holdoffs around calls into the storage-manager and VFS layers to prevent re-entrant calls. The local buffer manager is simple enough that it should be possible to flush them from within a signal handler at most times, but the VFS and storage manager are not safe to re-enter from a handler. Does this sound like a good idea? -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] PITR, checkpoint, and local relations
As per earlier discussion, I'm working on the hot backup issues as part of the PITR support. While I was looking at the buffer manager and the relcache/MyDb issues to figure out the best way to work this, it occurred to me that PITR will introduce a big problem with the way we handle local relations. The basic problem is that local relations (rd_myxactonly == true) are not part of a checkpoint, so there is no way to get a lower bound on the starting LSN needed to recover a local relation. In the past this did not matter, because either the local file would be (effectively) discarded during recovery because it had not yet become visible, or the file would be flushed before the transaction creating it made it visible. Now this is a problem. So I need a decision from the core team on what to do about the local buffer manager. My preference would be to forget about the local buffer manager entirely, or if not that then to allow it only for _true_ temporary data. The only alternative I can devise is to create some way for all other backends to participate in a checkpoint, perhaps using a signal. I'm not sure this can be done safely. Anyway, I'm glad the tuplesort stuff doesn't try to use relation files :-) Can the core team let me know if this is acceptable, and whether I should move ahead with changes to the buffer manager (and some other stuff) needed to avoid special treatment of rd_myxactonly relations? Also to Richard: have you guys at multera dealt with this issue already? Is there some way around this that I'm missing? Regards, John Nield Just as an example of this problem, imagine the following sequence: 1) Transaction TX1 creates a local relation LR1 which will eventually become a globally visible table. Tuples are inserted into the local relation, and logged to the WAL file. Some tuples remain in the local buffer cache and are not yet written out, although they are logged. TX1 is still in progress. 2) Backup starts, and checkpoint is called to get a minimum starting LSN (MINLSN) for the backed-up files. Only the global buffers are flushed. 3) Backup process copies LR1 into the backup directory. (postulate some way of coordinating with the local buffer manager, a problem I have not solved). 4) TX1 commits and flushes its local buffers. A dirty buffer exists whose LSN is before MINLSN. LR1 becomes globally visible. 5) Backup finishes copying all the files, including the local relations, and then flushes the log. The log files between MINLSN and the current LSN are copied to the backup directory, and backup is done. 6) Sometime later, a system administrator restores the backup and plays the logs forward starting at MINLSN. LR1 will be corrupt, because some of the log entries required for its restoration will be before MINLSN. This corruption will not be detected until something goes wrong. BTW: The problem doesn't only happen with backup! It occurs at every checkpoint as well, I just missed it until I started working on the hot backup issue. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Richard: I can't quite follow this; maybe you sent a draft by accident. If you want to post a patch against 7.2.1, or even better against HEAD in CVS, that would be great. Or if you'd rather point me to your source online, that would be good too. I just want to clarify though: is this work released to the PostgreSQL Development group by Progress and Multera, or do they still claim copyright interest in it? Regards, J.R. Nield On Thu, 2002-07-18 at 12:56, Richard Tucker wrote: -Original Message- From: J. R. Nield [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 17, 2002 8:13 PM To: [EMAIL PROTECTED] Cc: Bruce Momjian Subject: RE: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR) On Wed, 2002-07-17 at 19:25, Richard Tucker wrote: Regarding hot backup. Our implementation of pg_copy does a hot backup. It turns off database checkpointing for the duration of the backup. Backups all the files of the database cluster up to the wal file currently being logged to. It then acquires the WalInsertLock lock long enough to backup the current wal file. Does it then allow more writes to that WAL file? It would seem like you want to advance the log to the next file, so the sysadmin wouldn't have to choose which one of log-file number 3 he wants to use at restore. Writes to the wal file are allowed during the backup except for the backing of the wal file current when the backup completes. That is the pg_xlog directory is the last directory to be backed up. The wal_files are backed up in the order they were used. Continued wal file logging is allowed until the backup reaches the current wal file being written to. To back up this last wal file the WalInsertLock is held until the copy of the wal file is complete. So the backup stops update activity only long enough to copy this last 16mb file. Also, what do you mean by 'turns off checkpointing'. You have to do a checkpoint, or at least flush the buffers, when you start the backup. Otherwise how do you know what LSN to start from at restore? The pg_control file also gets backed up. It contains the point in the log at which to begin the redo/roll forward. By not allowing the redo point to advance while the backup goes on means that the startup processes' crash recovery code will capture all the changes made to the database cluster while the backup was running. Anyway: Yes we'd love to see the code. In what form would you like me to send the code to you e.g. as a patch, copy our whole source ... Since I've pretty-much got create/drop and index stuff working, if your code does the rest then we should be good to go. ;jrnield -- J. R. Nield [EMAIL PROTECTED] -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On Wed, 2002-07-17 at 01:25, Bruce Momjian wrote: We only patch configure.in. If you post to hackers, they can give you assistance and I will try to help however I can. I can so some configure.in stuff for you myself. Thanks for the offer. The only thing I was changing it for was to test whether and how to get a ethernet MAC address using ioctl, so libuuid could use it if available. That is dropped now. Related to that, the other place I need advice is on adding Ted Tso's LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we allowed to use this? There is a free OSF/DCE spec for UUID's, so I can re-implement the library if required. We talked about this on the replication mailing list. We decided that hostname, properly hashed to an integer, was the proper way to get this value. Also, there should be a postgresql.conf variable so you can override the hostname-generated value if you wish. I think that is sufficient. I will do something like this, but reserve 16 bytes for it just in case we change our minds. It needs to be different among systems on the same machine, so there needs to be a time value and a pseudo-random part as well. Also, 'hostname' will likely be the same on many machines (localhost.localdomain or similar). The only reason I bothered with UUID's before is because they have a standard setup to make the possibility of collision extremely small, and I figured replication will end up using it someday. We also haven't discussed commands for backup/restore, but I will use what I think is appropriate and we can change the grammar if needed. The initial hot-backup capability will require the database to be in read-only mode and use tar for backup, and I will add the ability to allow writes later. Yea, I saw Tom balked at that. I think we have enough manpower and time that we can get hot backup in normal read/write mode working before 7.3 beta so I would just code it assuming the system is live and we can deal with making it hot-capable once it is in CVS. It doesn't have to work 100% until beta time. Hot backup read/write requires that we force an advance in the logfile segment after the backup. We need to save all the logs between backup start and completion. Otherwise the files will be useless as a standalone system if the current logs somehow get destroyed (fire in the machine room, etc.). The way I would do this is: create a checkpoint do the block-by-block walk of the files using the bufmgr create a second checkpoint force the log to advance past the end of the current segment save the log segments containing records between the first second checkpont with the backup Then if you restore the backup, you can recover to the point of the second checkpoint, even if the logs since then are all gone. Right now the log segment size is fixed, so this means that we'd waste 8MB of log space on average to do a backup. Also, the way XLOG reads records right now, we have to write placeholder records into the empty space, because that's how it finds the end of the log stream. So I need to change XLOG to handle skip records, and then to truncate the file when it gets archived, so we don't have to save up to 16MB of zeros. Also, if archiving is turned off, then we can't recycle or delete any logs for the duration of the backup, and we have to save them. So I'll finish the XLOG support for this, and then think about the correct way to walk through all the files. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On Tue, 2002-07-16 at 15:36, Bruce Momjian wrote: J.R., just checking to see how PITR recovery is going. Do you need any assistance or have any questions for us? Also, do you have any idea how close you are to having something completed? Are you aware we are closing development of 7.3 at the end of August and start beta September 1? Is there any way we can help you? It should be ready to go into CVS by the end of the month. That will include: logging all operations except for rtree and GiST, archival of logfiles (with new postgresql.conf params), headers on the logfiles to verify the system that created them, standalone backend recovery to a point-in-time, and a rudimentary hot backup capability. I could use some advice on the proper way to add tests to configure.in, given that the autoconf output is in CVS. Would you ever want a patch to include the generated 'configure' file? Related to that, the other place I need advice is on adding Ted Tso's LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we allowed to use this? There is a free OSF/DCE spec for UUID's, so I can re-implement the library if required. We also haven't discussed commands for backup/restore, but I will use what I think is appropriate and we can change the grammar if needed. The initial hot-backup capability will require the database to be in read-only mode and use tar for backup, and I will add the ability to allow writes later. Does this sound like a reasonable timeframe/feature-set to make the 7.3 release? -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] pgbench questions
On Sat, 2002-07-13 at 02:25, Tatsuo Ishii wrote: (2) At least in the current CVS version, the code to do a 'CHECKPOINT' after creating a table has been #ifdef'ed out. Why is that? That is not after creation of a table, but while creating it, which is not necessary any more since Tom has fix the growth of WAL logs. Tatsou: Could you or Tom give me some background on what this change was about? Is this something recent, or would it have been in CVS about a month ago? I can't see any reason to force a checkpoint after CREATE TABLE, but it would be interesting to know why it was done before. ;John Nield -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] please help on query
On Thu, 2002-07-11 at 11:22, Luis Alberto Amigo Navarro wrote: I can't improve performance on this query: Blame Canada! -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
corruption, and we restore it from backup, how do we know what archived files we need for recovery? Should file corruption (partial write, bad disk block, etc.) outside the system catalog cause us to abort the system, or should we just take the relation or database off-line? Given a backup file, how do we determine the point in the log where we should start recovery for the file? What is the highest LSN we can use that will fully recover the file? §5.2 - Proposal: Put a file header on each file, and update that header to the last checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or at least once every dbsize/'file_lsn_log_slack' megabytes of log written, where dbsize is the estimated size of the database. Have these values be settable from the config file. These updates would be distributed throughout the hour, or interspersed between regular amounts of log generation. If we have a database backup program or command, it can update the header on the file before backup to the greatest value it can assure to be safe. §5.3 - Status: Waiting for input from pgsql-hackers. Questions: There are alternate methods than using a file header to get a known-good LSN lower bound for the starting point to recover a backup file. Is this the best way? A) The Definitions This stuff is obtuse, but I need it here to keep track of what I'm saying. Someday I should use it consistently in the rest of this document. system or database system: A collection of postgres databases in one $PGDATA directory, managed by one postmaster instance at a time (and having one WAL log, etc.) All the files composing such a system, as a group. up to date or now or current or current LSN: The most recent durable LSN for the system. block consistent copy: When referring to a file: A copy of a file, which may be written to during the process of copying, but where each BLCKSZ size block is copied atomically. When referring to multiple files (in the same system): A copy of all the files, such that each is independently a block consistent copy file consistent copy: When referring to a file: A copy of a file that is not written to between the start and end of the copy operation. When referring to multiple files (in the same system): A copy of all the files, such that each is independently a file consistent copy system consistent copy: When referring to a file: A copy of a file, where the entire system of which it is a member is not written to during the copy. When referring to multiple files (in the same system): A copy of all the files, where the entire system of which they are members was not written to between the start and end of the copying of all the files, as a group. shutdown consistent copy: When referring to a file: A copy of a file, where the entire system of which it is a member had been cleanly shutdown before the start of and for the duration of the copy. When referring to multiple files (in the same system): A copy of all the files, where the entire system of which they are members had been cleanly shutdown before the start of and for the duration of the copying of all the files, as a group. consistent copy: A block, file, system, or shutdown consistent copy. known-good LSN lower bound or LSN lower bound or LSN-LB: When referring to a group of blocks, a file, or a group of files: An LSN known to be old enough that no log entries before it are needed to bring the blocks or files up-to-date. known-good LSN greatest lower bound or LSN greatest lower bound or LSN-GLB: When referring to a group of blocks, a file, or a group of files: The greatest possible LSN that is a known-good LSN lower bound for the group. backup file: A consistent copy of a data file used by the system, for which we have a known-good LSN lower bound. optimal backup file: A backup file, for which we have the known-good LSN greatest lower bound. backup system: Play-Forward File Recovery or PFFR: The process of bringing an individual backup file up to date. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On Thu, 2002-07-04 at 11:45, J. R. Nield wrote: One other item that should be here: The big items so-far are: §1 - Logging Relation file creation, truncation, and removal This is mostly done. Can do infinte play-forward from online logs. §2 - Partial-Write and Bad Block detection Need input before starting. Migration issues. §3 - Detecting Shutdown Consistent System Recovery Mostly done. §4 - Interactive Play-Forward Recovery for an Entire System Need input before starting. §5 - Individual file consistent recovery Need input. Semi-Major changes required. §6 - btbuild is not logged Not logged because of same assumptions as for file create. Only need to log the index build parameters to recreate the index, not each page change. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Vacuum Daemon
On Sat, 2002-06-29 at 20:14, Tom Lane wrote: Matthew T. O'Connor [EMAIL PROTECTED] writes: Second: There was some discussion (http://archives.postgresql.org/pgsql-hackers/2002-05/msg00970.php) about this not being neede once UNDO is on place, what is the current view on this? I do not think that is the case; and anyway we've pretty much rejected Vadim's notion of going to an Oracle-style UNDO buffer. I don't foresee VACUUM going away anytime soon --- what we need is to make it less obtrusive. 7.2 made some progress in that direction, but we need more. Could someone point me to this discussion, or summarize what the problem was? Was his proposal to keep tuple versions in the UNDO AM, or only pointers to them? The referred-to message seems to be about something else. ;jrnield -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Vacuum Daemon
On Sat, 2002-06-29 at 21:55, Tom Lane wrote: J. R. Nield [EMAIL PROTECTED] writes: I do not think that is the case; and anyway we've pretty much rejected Vadim's notion of going to an Oracle-style UNDO buffer. Could someone point me to this discussion, or summarize what the problem was? I'm too lazy to dig through the archives at the moment, but the main points were (a) a finite-size UNDO buffer chokes on large transactions Yes this is a good point. Oracle was always lame with its ROLLBACK SEGMENTS. SolidDB (SolidWorks? It's been a while...) used a btree-like structure for this that was not of fixed size. Oracle supposedly moved to the same method in its 9i release, but I don't know the details. I could never figure out how they did this, until I realized that UNDO doesn't need to be in the WAL log. You just use any access method you feel like, and make sure the method is itself protected by REDO. Just insert REDO entries to protect the insert into the UNDO AM, and REDO log when you delete. That makes it easy to have the recovery code be idempotent, to catch the case of a system crash during recovery. and (b) the Oracle approach requires live transaction processing to do the cleanup work that our approach can push off to hopefully-not- time-critical vacuum processing. I'm not sure which way I'm leaning on this. On the one hand, it requires extra work to clean up while the system is live, in addition to writing the undo records, though the cleanup is not necessarily by the same transaction that committed the work (the cleanup needs to be deferred until it's out of an active snapshot anyway). On the other hand, you can clean-up without a full table scan, because you know which tuples need to be changed. This can be a big advantage on gigantic tables. Also, it lets you remove deleted tuples quickly, so the space can be reused, and eliminates the xid wraparound problem. Of course, any kind of undo is worse for performance with bulk inserts/updates, so you either end up committing every few thousand inserts, or you use some special extension to disable undo logging for a bulk load (or if you really want to be able to roll it back, you live with it :-) How slow is it to vacuum a 1 TB database with postgres? Do we even have any users who could test this? Also, I would never advocate that we do what I'm pretty sure Oracle does, and keep old values in the Rollback Segment. Only (RelFileNode, ItemDataPointer) addresses would need to be kept in the UNDO AM, if we went this route. UNDO per se doesn't eliminate VACUUM anyhow; it only reclaims space from tuples written by aborted transactions. If you want to get rid of VACUUM then you need another way to get rid of the old versions of perfectly good committed tuples that are obsoleted by updates from later transactions. That essentially means you need an overwriting storage manager, which is a concept that doesn't mix well with MVCC. Well, you can keep the UNDO records after commit to do a fast incremental vacuum as soon as the transaction that deleted the tuples becomes older than the oldest snapshot. If this is always done whenever an XID becomes that old, then you never need to vacuum, and you never need a full table scan. Because postgres never overwrites (except at vacuum), I think it actually makes us a BETTER candidate for this to be implemented cleanly then with an overwriting storage manager. We will never need to keep tuple values in UNDO! Oracle found a solution to that conundrum, but it's really not obvious to me that their solution is better than ours. Their approach was worse, because they had an overwriting storage manager before they tried to implement it (I'm guessing). :-) Also, they have patents that we'd probably run afoul of if we try to imitate their approach too closely. Given the current state of affairs here in the US, PostgreSQL probably violates hundreds or even thousands of software patents. It probably violates tens of patents that have been upheld in court. The only thing keeping companies from shutting down postgres, linux, OpenOffice, and a hundred other projects is fear of adverse publicity, and the fact that development would move overseas and continue to be a thorn in their side. We'll see how long this lasts, given the fear some vendors have of certain maturing open-source/GPL projects, but I don't think PostgreSQL will be first, since anyone can take this code and become an instant proprietary database vendor! (No, I'm not complaining. Please, nobody start a license fight because of this) -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 21:29, J. R. Nield wrote: If is impossible to do what you want. You can not protect against... Wow. The number of typo's in that last one was just amazing. I even started with one. Have an nice weekend everybody :-) ;jrnield -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: Yes, I don't see writing to two files vs. one to be any win, especially when we need to fsync both of them. What I would really like is to avoid the double I/O of writing to WAL and to the data file; improving that would be a huge win. If is impossible to do what you want. You can not protect against partial writes without writing pages twice and calling fdatasync between them while going through a generic filesystem. The best disk array will not protect you if the operating system does not align block writes to the structure of the underlying device. Even with raw devices, you need special support or knowledge of the operating system and/or the disk device to ensure that each write request will be atomic to the underlying hardware. All other systems rely on the fact that you can recover a damaged file using the log archive. This means downtime in the rare case, but no data loss. Until PostgreSQL can do this, then it will not be acceptable for real critical production use. This is not to knock PostgreSQL, because it is a very good database system, and clearly the best open-source one. It even has feature advantages over the commercial systems. But at the end of the day, unless you have complete understanding of the I/O system from write(2) through to the disk system, the only sure ways to protect against partial writes are by careful writes (in the WAL log or elsewhere, writing pages twice), or by requiring (and allowing) users to do log-replay recovery when a file is corrupted by a partial write. As long as there is a UPS, and the operating system doesn't crash, then there still should be no partial writes. If we log pages to WAL, they are useless when archived (after a checkpoint). So either we have a separate log for them (the ping-pong file), or we should at least remove them when archived, which makes log archiving more complex but is perfectly doable. Finally, I would love to hear why we are using the operating system buffer manager at all. The OS is acting as a secondary buffer manager for us. Why is that? What flaw in our I/O system does this reveal? I know that: We sync only WAL, not the other pages, except for the sync() call we do during checkpoint when we discard old WAL files. But this is probably not a good thing. We should only be writing blocks when they need to be on disk. We should not be expecting the OS to write them sometime later and avoid blocking (as long) for the write. If we need that, then our buffer management is wrong and we need to fix it. The reason we are doing this is because we expect the OS buffer manager to do asynchronous I/O for us, but then we don't control the order. That is the reason why we have to call fdatasync(), to create sequence points. The reason we have performance problems with either D_OSYNC or fdatasync on the normal relations is because we have no dbflush process. This causes an unacceptable amount of I/O blocking by other transactions. The ORACLE people were not kidding when they said that they could not certify Linux for production use until it supported O_DSYNC. Can you explain why that was the case? Finally, let me apologize if the above comes across as somewhat belligerent. I know very well that I can't compete with you guys for knowledge of the PosgreSQL system. I am still at a loss when I look at the optimizer and executor modules, and it will take some time before I can follow discussion of that area. Even then, I doubt my ability to compare with people like Mr. Lane and Mr. Momjian in experience and general intelligence, or in the field of database programming and software development in particular. However, this discussion and a search of the pgsql-hackers archives reveals this problem to be the KEY area of PostgreSQL's failing, and general misunderstanding, when compared to its commercial competitors. Sincerely, J. R. Nield -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 23:40, Curt Sampson wrote: On 23 Jun 2002, J. R. Nield wrote: If is impossible to do what you want. You can not protect against partial writes without writing pages twice and calling fdatasync between them while going through a generic filesystem. I agree with this. The best disk array will not protect you if the operating system does not align block writes to the structure of the underlying device. This I don't quite understand. Assuming you're using a SCSI drive (and this mostly applies to ATAPI/IDE, too), you can do naught but align block writes to the structure of the underlying device. When you initiate a SCSI WRITE command, you start by telling the device at which block to start writing and how many blocks you intend to write. Then you start passing the data. All I'm saying is that the entire postgresql block write must be converted into exactly one SCSI write command in all cases, and I don't know a portable way to ensure this. Even with raw devices, you need special support or knowledge of the operating system and/or the disk device to ensure that each write request will be atomic to the underlying hardware. Well, so here I guess you're talking about two things: 1. When you request, say, an 8K block write, will the OS really write it to disk in a single 8K or multiple of 8K SCSI write command? 2. Does the SCSI device you're writing to consider these writes to be transactional. That is, if the write is interrupted before being completed, does the SCSI device guarantee that the partially-sent data is not written, and the old data is maintained? And of course, does it guarantee that, when it acknowledges a write, that write is now in stable storage and will never go away? Both of these are not hard to guarantee, actually. For a BSD-based OS, for example, just make sure that your filesystem block size is the same as or a multiple of the database block size. BSD will never write anything other than a block or a sequence of blocks to a disk in a single SCSI transaction (unless you've got a really odd SCSI driver). And for your disk, buy a Baydel or Clarion disk array, or something similar. Given that it's not hard to set up a system that meets these criteria, and this is in fact commonly done for database servers, it would seem a good idea for postgres to have the option to take advantage of the time and money spent and adjust its performance upward appropriately. I agree with this. My point was only that you need to know what guarantees your operating system/hardware combination provides on a case-by-case basis, and there is no standard way for a program to discover this. Most system administrators are not going to know this either, unless databases are their main responsibility. All other systems rely on the fact that you can recover a damaged file using the log archive. Not exactly. For MS SQL Server, at any rate, if it detects a page tear you cannot restore based on the log file alone. You need a full or partial backup that includes that entire torn block. I should have been more specific: you need a backup of the file from some time ago, plus all the archived logs from then until the current log sequence number. This means downtime in the rare case, but no data loss. Until PostgreSQL can do this, then it will not be acceptable for real critical production use. It seems to me that it is doing this right now. In fact, it's more reliable than some commerial systems (such as SQL Server) because it can recover from a torn block with just the logfile. Again, what I meant to say is that the commercial systems can recover with an old file backup + logs. How old the backup can be depends only on how much time you are willing to spend playing the logs forward. So if you do a full backup once a week, and multiplex and backup the logs, then even if a backup tape gets destroyed you can still survive. It just takes longer. Also, postgreSQL can't recover from any other type of block corruption, while the commercial systems can. That's what I meant by the critical production use comment, which was sort-of unfair. So I would say they are equally reliable for torn pages (but not bad blocks), and the commercial systems let you trade potential recovery time for not having to write the blocks twice. You do need to back-up the log archives though. But at the end of the day, unless you have complete understanding of the I/O system from write(2) through to the disk system, the only sure ways to protect against partial writes are by careful writes (in the WAL log or elsewhere, writing pages twice), or by requiring (and allowing) users to do log-replay recovery when a file is corrupted by a partial write. I don't understand how, without a copy of the old data that was in the torn block, you can restore that block from just log file entries. Can you
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Mon, 2002-06-24 at 17:16, Tom Lane wrote: I think you have been missing the point... Yes, this appears to be the case. Thanks especially to Curt for clearing things up for me. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: J. R. Nield wrote: One other point: Page pre-image logging is fundamentally the same as what Jim Grey's book[1] would call careful writes. I don't believe they should be in the XLOG, because we never need to keep the pre-images after we're sure the buffer has made it to the disk. Instead, we should have the buffer IO routines implement ping-pong writes of some kind if we want protection from partial writes. Ping-pong writes to where? We have to fsync, and rather than fsync that area and WAL, we just do WAL. Not sure about a win there. The key question is: do we have some method to ensure that the OS doesn't do the writes in parallel? If the OS will ensure that one of the two block writes of a ping-pong completes before the other starts, then we don't need to fsync() at all. The only thing we are protecting against is the possibility of both writes being partial. If neither is done, that's fine because WAL will protect us. If the first write is partial, we will detect that and use the old data from the other, then recover from WAL. If the first is complete but the second is partial, then we detect that and use the newer block from the first write. If the second is complete but the first is partial, we detect that and use the newer block from the second write. So does anyone know a way to prevent parallel writes in one of the common unix standards? Do they say anything about this? It would seem to me that if the same process does both ping-pong writes, then there should be a cheap way to enforce a serial order. I could be wrong though. As to where the first block of the ping-pong should go, maybe we could reserve a file with nBlocks space for them, and write the information about which block was being written to the XLOG for use in recovery. There are many other ways to do it. ;jrnield -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 11:19, Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: This should also allow us to disable completely the ping-pong writes if we have a disk subsystem that we trust. If we have a disk subsystem we trust, we just disable fsync on the WAL and the performance issue largely goes away. It wouldn't work because the OS buffering interferes, and we need those WAL records on disk up to the greatest LSN of the Buffer we will be writing. We already buffer WAL ourselves. We also already buffer regular pages. Whenever we write a Buffer out of the buffer cache, it is because we really want that page on disk and wanted to start an IO. If thats not the case, then we should have more block buffers! So since we have all this buffering designed especially to meet our needs, and since the OS buffering is in the way, can someone explain to me why postgresql would ever open a file without the O_DSYNC flag if the platform supports it? I concur with Bruce: the reason we keep page images in WAL is to minimize the number of places we have to fsync, and thus the amount of head movement required for a commit. Putting the page images elsewhere cannot be a win AFAICS. Why not put all the page images in a single pre-allocated file and treat it as a ring? How could this be any worse than flushing them in the WAL log? Maybe fsync would be slower with two files, but I don't see how fdatasync would be, and most platforms support that. What would improve performance would be to have a dbflush process that would work in the background flushing buffers in groups and trying to stay ahead of ReadBuffer requests. That would let you do the temporary side of the ping-pong as a huge O_DSYNC writev(2) request (or fdatasync() once) and then write out the other buffers. It would also tend to prevent the other backends from blocking on write requests. A dbflush could also support aio_read/aio_write on platforms like Solaris and WindowsNT that support it. Am I correct that right now, buffers only get written when they get removed from the free list for reuse? So a released dirty buffer will sit in the buffer free list until it becomes the Least Recently Used buffer, and will then cause a backend to block for IO in a call to BufferAlloc? This would explain why we like using the OS buffer cache, and why our performance is troublesome when we have to do synchronous IO writes, and why fsync() takes so long to complete. All of the backends block for each call to BufferAlloc() after a large table update by a single backend, and then the OS buffers are always full of our written data. Am I reading the bufmgr code correctly? I already found an imaginary race condition there once :-) ;jnield Well, whether or not there's a cheap way depends on whether you consider fsync to be cheap. :-) It's never cheap :-( -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Sun, 2002-06-23 at 12:10, Curt Sampson wrote: So what we would really need to implement, if we wanted to be more efficient with trusted disk subsystems, would be the option of writing to the log only the changed row or changed part of the row, or writing the entire changed page. I don't know how hard this would be We already log that stuff. The page images are in addition to the Logical Changes, so we could just stop logging the page images. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote: I was wondering, how does knowing the block is corrupt help MS SQL? Right now, we write changed pages to WAL, then later write them to disk. I have always been looking for a way to prevent these WAL writes. The 512-byte bit seems interesting, but how does it help? And how does the bit help them with partial block writes? Is the bit at the end of the block? Is that reliable? My understanding of this is as follows: 1) On most commercial systems, if you get a corrupted block (from partial write or whatever) you need to restore the file(s) from the most recent backup, and replay the log from the log archive (usually only the damaged files will be written to during replay). 2) If you can't deal with the downtime to recover the file, then EMC, Sun, or IBM will sell you an expensive disk array with an NVRAM cache that will do atomic writes. Some plain-vanilla SCSI disks are also capable of atomic writes, though usually they don't use NVRAM to do it. The database must then make sure that each page-write gets translated into exactly one SCSI-level write. This is one reason why ORACLE and Sybase recommend that you use raw disk partitions for high availability. Some operating systems support this through the filesystem, but it is OS dependent. I think Solaris 7 8 has support for this, but I'm not sure. PostgreSQL has trouble because it can neither archive logs for replay, nor use raw disk partitions. One other point: Page pre-image logging is fundamentally the same as what Jim Grey's book[1] would call careful writes. I don't believe they should be in the XLOG, because we never need to keep the pre-images after we're sure the buffer has made it to the disk. Instead, we should have the buffer IO routines implement ping-pong writes of some kind if we want protection from partial writes. Does any of this make sense? ;jrnield [1] Grey, J. and Reuter, A. (1993). Transaction Processing: Concepts and Techniques. Morgan Kaufmann. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster