Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On Thu, 2002-07-18 at 15:36, Richard Tucker wrote: > Sorry, I didn't delimit my comments correctly before. > I'm offering our pg_copy/roll forward implementation to PostgreSQL.org if it > finds it an acceptable contribution. Progress/Multera would hand over all > rights to any code accepted. > > I'd be willing to post a patch to HEAD -- where can I find instructions on > how to do this? > Instructions for how to get the latest development source are at: http://developer.postgresql.org/TODO/docs/cvs.html If you want to post a context diff against 7.2.1 to pgsql-hackers first, that would let us see what it does. Let me know if there is anything else I can help you with. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Progress/Multera would release the hot backup/roll forward work to the PostgreSQL Development group. -regards richt -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of J. R. Nield Sent: Thursday, July 18, 2002 2:34 PM To: [EMAIL PROTECTED] Cc: Bruce Momjian; PostgreSQL Hacker Subject: Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR) Richard: I can't quite follow this; maybe you sent a draft by accident. If you want to post a patch against 7.2.1, or even better against HEAD in CVS, that would be great. Or if you'd rather point me to your source online, that would be good too. I just want to clarify though: is this work released to the PostgreSQL Development group by Progress and Multera, or do they still claim copyright interest in it? Regards, J.R. Nield On Thu, 2002-07-18 at 12:56, Richard Tucker wrote: > > > -Original Message- > From: J. R. Nield [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, July 17, 2002 8:13 PM > To: [EMAIL PROTECTED] > Cc: Bruce Momjian > Subject: RE: [HACKERS] Issues Outstanding for Point In Time Recovery > (PITR) > > > On Wed, 2002-07-17 at 19:25, Richard Tucker wrote: > > Regarding hot backup. Our implementation of "pg_copy" does a hot backup. > > It turns off database checkpointing for the duration of the backup. > Backups > > all the files of the database cluster up to the wal file currently being > > logged to. It then acquires the WalInsertLock lock long enough to backup > > the current wal file. > > Does it then allow more writes to that WAL file? It would seem like you > want to advance the log to the next file, so the sysadmin wouldn't have > to choose which one of log-file number 3 he wants to use at restore. > > > Writes to the wal file are allowed during the backup except for the > backing of the wal file current when the > > backup completes. That is the pg_xlog directory is the last directory to > be backed up. The wal_files are backed > > up in the order they were used. Continued wal file logging is allowed > until the backup reaches the current wal > > file being written to. To back up this last wal file the WalInsertLock is > held until the copy of the wal file > > is complete. So the backup stops update activity only long enough to copy > this last 16mb file. > > Also, what do you mean by 'turns off checkpointing'. You have to do a > checkpoint, or at least flush the buffers, when you start the backup. > Otherwise how do you know what LSN to start from at restore? > > > The pg_control file also gets backed up. It contains the point in the log > at which to begin > > the redo/roll forward. > > By not allowing the redo point to advance while the backup goes on means > that the startup processes' crash > > recovery code will capture all the changes made to the database cluster > while the backup was running. > > > Anyway: Yes we'd love to see the code. > > > In what form would you like me to send the code to you e.g. as a patch, > copy our whole source ... > > Since I've pretty-much got create/drop and index stuff working, if your > code does the rest then we should be good to go. > > ;jrnield > > > -- > J. R. Nield > [EMAIL PROTECTED] > > > > -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Richard: I can't quite follow this; maybe you sent a draft by accident. If you want to post a patch against 7.2.1, or even better against HEAD in CVS, that would be great. Or if you'd rather point me to your source online, that would be good too. I just want to clarify though: is this work released to the PostgreSQL Development group by Progress and Multera, or do they still claim copyright interest in it? Regards, J.R. Nield On Thu, 2002-07-18 at 12:56, Richard Tucker wrote: > > > -Original Message- > From: J. R. Nield [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, July 17, 2002 8:13 PM > To: [EMAIL PROTECTED] > Cc: Bruce Momjian > Subject: RE: [HACKERS] Issues Outstanding for Point In Time Recovery > (PITR) > > > On Wed, 2002-07-17 at 19:25, Richard Tucker wrote: > > Regarding hot backup. Our implementation of "pg_copy" does a hot backup. > > It turns off database checkpointing for the duration of the backup. > Backups > > all the files of the database cluster up to the wal file currently being > > logged to. It then acquires the WalInsertLock lock long enough to backup > > the current wal file. > > Does it then allow more writes to that WAL file? It would seem like you > want to advance the log to the next file, so the sysadmin wouldn't have > to choose which one of log-file number 3 he wants to use at restore. > > > Writes to the wal file are allowed during the backup except for the > backing of the wal file current when the > > backup completes. That is the pg_xlog directory is the last directory to > be backed up. The wal_files are backed > > up in the order they were used. Continued wal file logging is allowed > until the backup reaches the current wal > > file being written to. To back up this last wal file the WalInsertLock is > held until the copy of the wal file > > is complete. So the backup stops update activity only long enough to copy > this last 16mb file. > > Also, what do you mean by 'turns off checkpointing'. You have to do a > checkpoint, or at least flush the buffers, when you start the backup. > Otherwise how do you know what LSN to start from at restore? > > > The pg_control file also gets backed up. It contains the point in the log > at which to begin > > the redo/roll forward. > > By not allowing the redo point to advance while the backup goes on means > that the startup processes' crash > > recovery code will capture all the changes made to the database cluster > while the backup was running. > > > Anyway: Yes we'd love to see the code. > > > In what form would you like me to send the code to you e.g. as a patch, > copy our whole source ... > > Since I've pretty-much got create/drop and index stuff working, if your > code does the rest then we should be good to go. > > ;jrnield > > > -- > J. R. Nield > [EMAIL PROTECTED] > > > > -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
J. R. Nield wrote: > I will do something like this, but reserve 16 bytes for it just in case > we change our minds. It needs to be different among systems on the same > machine, so there needs to be a time value and a pseudo-random part as > well. Also, 'hostname' will likely be the same on many machines > (localhost.localdomain or similar). > > The only reason I bothered with UUID's before is because they have a > standard setup to make the possibility of collision extremely small, and > I figured replication will end up using it someday. Sure. Problem is, we support so many platforms that any trickery is a problem. If they can change it in postgresql.conf, that should be sufficient. > Hot backup read/write requires that we force an advance in the logfile > segment after the backup. We need to save all the logs between backup > start and completion. Otherwise the files will be useless as a > standalone system if the current logs somehow get destroyed (fire in the > machine room, etc.). > > The way I would do this is: > > create a checkpoint > do the block-by-block walk of the files using the bufmgr > create a second checkpoint > force the log to advance past the end of the current segment > save the log segments containing records between the >first & second checkpont with the backup Sounds good. > Then if you restore the backup, you can recover to the point of the > second checkpoint, even if the logs since then are all gone. Good, you put the logs that happened during the backup inside the same backup, make it consistent. Makes sense. > Right now the log segment size is fixed, so this means that we'd waste > 8MB of log space on average to do a backup. Also, the way XLOG reads > records right now, we have to write placeholder records into the empty > space, because that's how it finds the end of the log stream. So I need > to change XLOG to handle "skip records", and then to truncate the file > when it gets archived, so we don't have to save up to 16MB of zeros. > > Also, if archiving is turned off, then we can't recycle or delete any > logs for the duration of the backup, and we have to save them. > > So I'll finish the XLOG support for this, and then think about the > correct way to walk through all the files. Sounds like a good plan. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On Wed, 2002-07-17 at 01:25, Bruce Momjian wrote: > > We only patch configure.in. If you post to hackers, they can give you > assistance and I will try to help however I can. I can so some > configure.in stuff for you myself. Thanks for the offer. The only thing I was changing it for was to test whether and how to get a ethernet MAC address using ioctl, so libuuid could use it if available. That is dropped now. > > > Related to that, the other place I need advice is on adding Ted Tso's > > LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we > > allowed to use this? There is a free OSF/DCE spec for UUID's, so I can > > re-implement the library if required. > > We talked about this on the replication mailing list. We decided that > hostname, properly hashed to an integer, was the proper way to get this > value. Also, there should be a postgresql.conf variable so you can > override the hostname-generated value if you wish. I think that is > sufficient. I will do something like this, but reserve 16 bytes for it just in case we change our minds. It needs to be different among systems on the same machine, so there needs to be a time value and a pseudo-random part as well. Also, 'hostname' will likely be the same on many machines (localhost.localdomain or similar). The only reason I bothered with UUID's before is because they have a standard setup to make the possibility of collision extremely small, and I figured replication will end up using it someday. > > > We also haven't discussed commands for backup/restore, but I will use > > what I think is appropriate and we can change the grammar if needed. The > > initial hot-backup capability will require the database to be in > > read-only mode and use tar for backup, and I will add the ability to > > allow writes later. > > Yea, I saw Tom balked at that. I think we have enough manpower and time > that we can get hot backup in normal read/write mode working before 7.3 > beta so I would just code it assuming the system is live and we can deal > with making it hot-capable once it is in CVS. It doesn't have to work > 100% until beta time. Hot backup read/write requires that we force an advance in the logfile segment after the backup. We need to save all the logs between backup start and completion. Otherwise the files will be useless as a standalone system if the current logs somehow get destroyed (fire in the machine room, etc.). The way I would do this is: create a checkpoint do the block-by-block walk of the files using the bufmgr create a second checkpoint force the log to advance past the end of the current segment save the log segments containing records between the first & second checkpont with the backup Then if you restore the backup, you can recover to the point of the second checkpoint, even if the logs since then are all gone. Right now the log segment size is fixed, so this means that we'd waste 8MB of log space on average to do a backup. Also, the way XLOG reads records right now, we have to write placeholder records into the empty space, because that's how it finds the end of the log stream. So I need to change XLOG to handle "skip records", and then to truncate the file when it gets archived, so we don't have to save up to 16MB of zeros. Also, if archiving is turned off, then we can't recycle or delete any logs for the duration of the backup, and we have to save them. So I'll finish the XLOG support for this, and then think about the correct way to walk through all the files. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
I don't know how our marketing came up third most popular but I think the order is, Oracle, MySQL, and PostgreSQL or maybe Oracle, MSSQL and PostgreSQL. I'm sure there is some criterion by which PostgreSQL is tenth and by some other its number one. Of course, my posting was about Point In Time Recovery and not multera marketing spin. -regards richt -Original Message- From: Bruce Momjian [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 17, 2002 6:29 PM To: Rod Taylor Cc: PostgreSQL-development; [EMAIL PROTECTED]; J. R. Nield Subject: Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR) Rod Taylor wrote: > > server powered by PostgreSQL, the third most popular database, and a > > > >^ > > > Wonder why we are "the third most popular database". I think that's > > good? > > You'll notice they didn't qualify where. On this list, it's probably > #1. Within Progress software perhaps we're third most popular (whatever > two are typically used in the InnerEdge are 1 and 2). Yea, using that logic, #1 would be the Progress internal db system, #2 would be MySQL (though that seems doubtful at this point with Nusphere), and PostgreSQL. Actually, PostgreSQL is #1 in my home. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On July 17, 2002 05:45 pm, Richard Tucker wrote: > We also have implemented a roll forward recovery mechanism. We modified a > 7.2.1 version of Postgres. > ... Excellent! I can't wait. When will it be in current? > The BIG THING we have not done is address the issue that add/drop tables > and indexes do not propagate through the roll forward recovery mechanism > properly. I can live with that. Schemas shouldn't change so often that you can't just dup any changes to the backup(s). -- D'Arcy J.M. Cain| Democracy is three wolves http://www.druid.net/darcy/| and a sheep voting on +1 416 425 1212 (DoD#0082)(eNTP) | what's for dinner. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Rod Taylor wrote: > > server powered by PostgreSQL, the third most popular database, and a > > > >^ > > > Wonder why we are "the third most popular database". I think that's > > good? > > You'll notice they didn't qualify where. On this list, it's probably > #1. Within Progress software perhaps we're third most popular (whatever > two are typically used in the InnerEdge are 1 and 2). Yea, using that logic, #1 would be the Progress internal db system, #2 would be MySQL (though that seems doubtful at this point with Nusphere), and PostgreSQL. Actually, PostgreSQL is #1 in my home. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
> server powered by PostgreSQL, the third most popular database, and a > >^ > Wonder why we are "the third most popular database". I think that's > good? You'll notice they didn't qualify where. On this list, it's probably #1. Within Progress software perhaps we're third most popular (whatever two are typically used in the InnerEdge are 1 and 2). ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Bruce Momjian wrote: > Richard Tucker wrote: > > > > We also have implemented a roll forward recovery mechanism. We modified a > > 7.2.1 version of Postgres. > > The mechanism is designed to provide a means of recoverying from the loss or > > corruption of media. It provides for duplicating wal_files so that if a > > wal_file is lost roll forward recovery can recover the database using the > > duplicated wal_files. Hooks were also added so that the roll forward > > recovery mechanism can be used to implement a hot standby database. > > Along with the roll forward recovery mechanism we have also implemented an > > online database backup utility which is synchronized with the recovery log > > so that the backup can be a starting point for a roll forward recovery > > session. > > In researching who has done this work, I found that > http://www.multera.com/ is Progress Software. I assume this > "Distributed but Connected" is partly based on PostgreSQL. Oh, I see it now, PostgreSQL is right there: * OuterEdge A distributed site technology suite that allows remote users to maintain the integrity, not the congestion and price, of centralized data. The OuterEdge includes the affordable UltraSQL^(TM) database server powered by PostgreSQL, the third most popular database, and a ^ Secure Application/web server that supports the industry's most popular Internet languages. Together with the Replication Engine software, the OuterEdge gives remote users seamless access to important data without relying on bandwidth. Wonder why we are "the third most popular database". I think that's good? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Richard Tucker wrote: > > We also have implemented a roll forward recovery mechanism. We modified a > 7.2.1 version of Postgres. > The mechanism is designed to provide a means of recoverying from the loss or > corruption of media. It provides for duplicating wal_files so that if a > wal_file is lost roll forward recovery can recover the database using the > duplicated wal_files. Hooks were also added so that the roll forward > recovery mechanism can be used to implement a hot standby database. > Along with the roll forward recovery mechanism we have also implemented an > online database backup utility which is synchronized with the recovery log > so that the backup can be a starting point for a roll forward recovery > session. In researching who has done this work, I found that http://www.multera.com/ is Progress Software. I assume this "Distributed but Connected" is partly based on PostgreSQL. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
We also have implemented a roll forward recovery mechanism. We modified a 7.2.1 version of Postgres. The mechanism is designed to provide a means of recoverying from the loss or corruption of media. It provides for duplicating wal_files so that if a wal_file is lost roll forward recovery can recover the database using the duplicated wal_files. Hooks were also added so that the roll forward recovery mechanism can be used to implement a hot standby database. Along with the roll forward recovery mechanism we have also implemented an online database backup utility which is synchronized with the recovery log so that the backup can be a starting point for a roll forward recovery session. Roll forward recovery is enabled for a database cluster by specifying one of two postmaster configuration parameters. wal_file_reuse = false This parameter tells the wal system to not reuse wal files. This option is intended for sites wishing to implement a hot standby database where wal files will be periodically copied to another machine where they will be rolled forward into the standby database. wal_file_duplicate = This parameter tells the system to mirror the files in the pg_xlog directory to the specified directory. This allows for the recovery of a database where a wal file has been damaged or lost. It also allows for a variant of a hot standby database where the duplicate directory is the pg_xlog directory of the standby database. Since both of these options cause wal files to accumulate indefinately the dba needs a means of purging wal files when they are no longer needed. So an sql command, "ALTER SYSTEM PURGE WAL_FILES ", has also been implemented. This command deletes all wal files up to and including the specified as long as those wal files are not needed to recover the database in the event of a system crash. To find out the status of the wal files a function has been implemented to return a wal file name. The function is: Wal_file( ) Request_type := [ current | last | checkpoint | oldest] Wal_file ('current') returns the name of the log file currently being written to. Wal_file('last') returns the name of the last log file filled. Wal_file('checkpoint') returns the name of the file containing the current redo position. The current redo position is the position in the recovery log where crash recovery would start if the system were to crash now. All logs prior to this one will not be needed to recover the database cluster and could be safely removed. Wal_file('oldest') returns the oldest xlog file found in the pg_xlog directory. To actually perform a roll forward you use the postmaster configuration parameter "roll_forward=yes". This parameter tells the startup process to perform crash recovery even though the state of the database as found in the pg_control file indicates a normal shutdown. This is necessary since the starting point of roll forward session could be the restore of a database cluster that was shutdown in order to back it up. Furthermore this parameter tells the startup process not to write out a checkpoint record at the end of the roll forward session. This allows for the database cluster to receive subsequent wal files and to have those rolled forward as well. When starting the postmaster with the roll_forward=yes option, it shuts down the database as soon as the startup process completes. So the idea is to restore a backup, copy all of your saved/duplicated wal files into the pg_xlog directory of the restored database and start the postmaster with the roll_forward option. For point in time recovery there is also a roll_forward_until = which rolls forward through the wal files until the first transaction commit note that is greater than or equal to the specified time. The pg_copy utility performs an on line copy of a database cluster. Its syntax is: pg_copy [-h host] [-p port] ... This makes a copy of the database where backup_directory is what you would set PGDATA to in order start a postmaster against the backup copy. The database can be being updated while the copy occurs. If you start a postmaster against this copy it will appear to the startup process as a database that crashed at the instant the pg_copy operation completed. Futhermore the pg_copy utility automatically removes any wal files not needed to recover the database from either pg_xlog directory or the wal_file_duplicate directory. So a DBA to protect the database from media loss just needs to set the wal_file_duplicate paramater and periodically pg_copy the database. The BIG THING we have not done is address the issue that add/drop tables and indexes do not propagate through the roll forward recovery mechanism properly. -regards Richard Tucker ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
J. R. Nield wrote: > On Tue, 2002-07-16 at 15:36, Bruce Momjian wrote: > > > > J.R., just checking to see how PITR recovery is going. Do you need any > > assistance or have any questions for us? > > > > Also, do you have any idea how close you are to having something > > completed? Are you aware we are closing development of 7.3 at the end > > of August and start beta September 1? Is there any way we can help you? > > > > It should be ready to go into CVS by the end of the month. > > That will include: logging all operations except for rtree and GiST, > archival of logfiles (with new postgresql.conf params), headers on the > logfiles to verify the system that created them, standalone backend > recovery to a point-in-time, and a rudimentary hot backup capability. Sounds great. That gives us another month to iron out any remaining issues. This will be a great 7.3 feature! > I could use some advice on the proper way to add tests to configure.in, > given that the autoconf output is in CVS. Would you ever want a patch to > include the generated 'configure' file? We only patch configure.in. If you post to hackers, they can give you assistance and I will try to help however I can. I can so some configure.in stuff for you myself. > Related to that, the other place I need advice is on adding Ted Tso's > LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we > allowed to use this? There is a free OSF/DCE spec for UUID's, so I can > re-implement the library if required. We talked about this on the replication mailing list. We decided that hostname, properly hashed to an integer, was the proper way to get this value. Also, there should be a postgresql.conf variable so you can override the hostname-generated value if you wish. I think that is sufficient. > We also haven't discussed commands for backup/restore, but I will use > what I think is appropriate and we can change the grammar if needed. The > initial hot-backup capability will require the database to be in > read-only mode and use tar for backup, and I will add the ability to > allow writes later. Yea, I saw Tom balked at that. I think we have enough manpower and time that we can get hot backup in normal read/write mode working before 7.3 beta so I would just code it assuming the system is live and we can deal with making it hot-capable once it is in CVS. It doesn't have to work 100% until beta time. > Does this sound like a reasonable timeframe/feature-set to make the 7.3 > release? Sounds great. This is another killer 7.3 feature, and we really need this for greater enterprise acceptance of PostgreSQL. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
"J. R. Nield" <[EMAIL PROTECTED]> writes: > Related to that, the other place I need advice is on adding Ted Tso's > LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we > allowed to use this? Uh, why exactly is UUID essential for this? (The correct answer is "it's not", IMHO.) > We also haven't discussed commands for backup/restore, but I will use > what I think is appropriate and we can change the grammar if needed. The > initial hot-backup capability will require the database to be in > read-only mode and use tar for backup, and I will add the ability to > allow writes later. There is no read-only mode, and I for one will resist adding one. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On Tue, 2002-07-16 at 15:36, Bruce Momjian wrote: > > J.R., just checking to see how PITR recovery is going. Do you need any > assistance or have any questions for us? > > Also, do you have any idea how close you are to having something > completed? Are you aware we are closing development of 7.3 at the end > of August and start beta September 1? Is there any way we can help you? > It should be ready to go into CVS by the end of the month. That will include: logging all operations except for rtree and GiST, archival of logfiles (with new postgresql.conf params), headers on the logfiles to verify the system that created them, standalone backend recovery to a point-in-time, and a rudimentary hot backup capability. I could use some advice on the proper way to add tests to configure.in, given that the autoconf output is in CVS. Would you ever want a patch to include the generated 'configure' file? Related to that, the other place I need advice is on adding Ted Tso's LGPL'd UUID library (stolen from e2fsprogs) to the source. Are we allowed to use this? There is a free OSF/DCE spec for UUID's, so I can re-implement the library if required. We also haven't discussed commands for backup/restore, but I will use what I think is appropriate and we can change the grammar if needed. The initial hot-backup capability will require the database to be in read-only mode and use tar for backup, and I will add the ability to allow writes later. Does this sound like a reasonable timeframe/feature-set to make the 7.3 release? -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
J.R., just checking to see how PITR recovery is going. Do you need any assistance or have any questions for us? Also, do you have any idea how close you are to having something completed? Are you aware we are closing development of 7.3 at the end of August and start beta September 1? Is there any way we can help you? --- J. R. Nield wrote: > On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote: > > > > We have needed > > point-in-time recovery for a long time, > > Most thanks should go to vadim (and whoever else worked on this), since > his WAL code already does most of the work. The key thing is auditing > the backend to look for every case where we assume some action is not > visible until after commit, and therefore don't log its effects. Those > are the main cases that must be changed. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On Tue, 2002-07-09 at 17:26, Tom Lane wrote: > Hannu Krosing <[EMAIL PROTECTED]> writes: > > 1) record the lowest uncommitted transaction number (LUTN) , this may > > have problems with wraparound, but I guess they are solvable. Disllow > > VACUUM. Do a CHECKPOINT ('alter database begin backup') > > 3) make a file-level (.tar) backup of data directory. > > 4) Allow VACUUM. ('alter database end backup') > > Transactions don't necessarily commit in sequence number order, so the > concept of LUTN seems meaningless. Not quite. It is the most simple way to be sure that if we invalidate all transactions >= than it we get back to a fairly recent Point-In-Time. The real solution would of course be to remember all committed transactions at this PIT, which can probably be done by remembering LUTN and all individual committed transactions > LUTN > Why is it necessary (or even good) to disallow VACUUM? So that it would be possible to resurrect these tuples that have been deleted/updated during disk-level backup. I would like better the ability to tell VACUUM not to touch tuples where deleting transaction number >= LUTN . IIRC the original postgres was able to do that. > I really dislike > a design that allows the DBA to cripple the database by forgetting the > last step in a (long) process. There are several ways around it. 1. do it in a script, that will not forget. 2. Closing the session that did 'alter database begin backup' session could do it automatically, but this would make the backup script trickier. 3. VACUUM should not block but report a warning about being restricted from running. 4. database can be instructed to send a message to DBA's pager if it has been in 'begin backup' state too long ;) Hannu ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Hannu Krosing <[EMAIL PROTECTED]> writes: > 1) record the lowest uncommitted transaction number (LUTN) , this may > have problems with wraparound, but I guess they are solvable. Disllow > VACUUM. Do a CHECKPOINT ('alter database begin backup') > 3) make a file-level (.tar) backup of data directory. > 4) Allow VACUUM. ('alter database end backup') Transactions don't necessarily commit in sequence number order, so the concept of LUTN seems meaningless. Why is it necessary (or even good) to disallow VACUUM? I really dislike a design that allows the DBA to cripple the database by forgetting the last step in a (long) process. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
I know that in Oracle there are 'alter database begin backup' and 'alter database end backup' commands that allow you to script your hot backups through a cron job by calling the begin backup command first, then using disk backup method of choice and then finally call the end backup command. --Barry Patrick Macdonald wrote: >Zeugswetter Andreas SB SD wrote: > > >>>As noted, one of the main problems is knowing where to begin >>>in the log. This can be handled by having backup processing >>>update the control file with the first lsn and log file >>>required. At the time of the backup, this information is or >>>can be made available. The control file can be the last file >>>added to the tar and can contain information spanning the entire >>>backup process. >>> >>> >>lsn and logfile number (of latest checkpoints) is already in the control >>file, thus you need control file at start of backup. (To reduce the number >>of logs needed for restore of an online backup you could force a checkpoint >>before starting file backup) >> >> > >Maybe I should have been more clear. The control file snapshot must >be taken at backup start (as you mention) but can be stored in cache. >The fields can then be modified as we see fit. At the end of backup, >we can write this to a temp file and add it to the tar. Therefore, >as mentioned, the snapshot spans the entire backup process. > > > >>You will also need lsn and logfile number after file backup, to know how much >>log needs to at least be replayed to regain a consistent state. >> >> > >This is a nicety but not a necessity. If you have a backup end log >record, you just have to enforce that the PIT recovery encounters >that particular log record on forward recovery. Once encountered, >you know that you at passed the point of back up end. > >Cheers, >Patrick > > > >---(end of broadcast)--- >TIP 5: Have you checked our extensive FAQ? > >http://www.postgresql.org/users-lounge/docs/faq.html > > > > > ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Zeugswetter Andreas SB SD wrote: > > > OK, so you do a tar backup of a file. While you are doing the tar, > > certain 8k blocks are being modified in the file. There is no way to > > know what blocks are modified as you are doing the tar, and in fact you > > could read partial page writes during the tar. > > No, I think all OS's (Unix and NT at least) guard against this, as long as > the whole 8k block is written in one call. It is only the physical layer (disk) > that is prone to partial writes. Yes, good point. The kernel will present a unified view of the 8k block. Of course, there are still cases where 8k blocks are being changed in front/behind in the tarred file. Will WAL allow us to re-synchronize that file even if part of it has pages from an earlier in time than other pages. Uh, I think so. So maybe we don't need the pre-write images in WAL after all. Can we replay the WAL when some pages in the restored file _have_ the WAL changes and some don't? Maybe the LSN on the pages helps with this. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Zeugswetter Andreas SB SD wrote: > > > As noted, one of the main problems is knowing where to begin > > in the log. This can be handled by having backup processing > > update the control file with the first lsn and log file > > required. At the time of the backup, this information is or > > can be made available. The control file can be the last file > > added to the tar and can contain information spanning the entire > > backup process. > > lsn and logfile number (of latest checkpoints) is already in the control > file, thus you need control file at start of backup. (To reduce the number > of logs needed for restore of an online backup you could force a checkpoint > before starting file backup) Maybe I should have been more clear. The control file snapshot must be taken at backup start (as you mention) but can be stored in cache. The fields can then be modified as we see fit. At the end of backup, we can write this to a temp file and add it to the tar. Therefore, as mentioned, the snapshot spans the entire backup process. > You will also need lsn and logfile number after file backup, to know how much > log needs to at least be replayed to regain a consistent state. This is a nicety but not a necessity. If you have a backup end log record, you just have to enforce that the PIT recovery encounters that particular log record on forward recovery. Once encountered, you know that you at passed the point of back up end. Cheers, Patrick ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
> As noted, one of the main problems is knowing where to begin > in the log. This can be handled by having backup processing > update the control file with the first lsn and log file > required. At the time of the backup, this information is or > can be made available. The control file can be the last file > added to the tar and can contain information spanning the entire > backup process. lsn and logfile number (of latest checkpoints) is already in the control file, thus you need control file at start of backup. (To reduce the number of logs needed for restore of an online backup you could force a checkpoint before starting file backup) You will also need lsn and logfile number after file backup, to know how much log needs to at least be replayed to regain a consistent state. Andreas ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
> OK, so you do a tar backup of a file. While you are doing the tar, > certain 8k blocks are being modified in the file. There is no way to > know what blocks are modified as you are doing the tar, and in fact you > could read partial page writes during the tar. No, I think all OS's (Unix and NT at least) guard against this, as long as the whole 8k block is written in one call. It is only the physical layer (disk) that is prone to partial writes. > any page that was modified while we were backing up is in the WAL. On > restore, we can recover whatever tar saw of the file, knowing that the > WAL page images will recover any page changes made during the tar. Assuming above, I do not think this is necessary. > What I suggest is a way for the backup tar to turn on pre-change page > images while the tar is happening, and turn it off after the tar is > done. Again, I do not think this is necessary. Andreas ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Patrick Macdonald wrote: > The idea of using the last lsn on the page to detect a partial > write is used by other dbms systems. You already have that > information available so there is no overhead in computing it. > Nothing wrong with CRC though. Agreed. Just thought I would point out that is not guaranteed. Suppose the 8k block is spread over 16 512 sectors in two cylinders. The OS or SCSI tagged queuing could wrote the second part of the page (sectors 9-16) before the first group (1-8). If it writes 9-16, then writes 1-8 but fails in the middle of 1-8, the LSN will match at the front and back of the page, but the page will be partially written. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
J.R., Nice first draft and a good read. Was going to comment in-line but thought this method would be easier to follow. The comments/suggestions below assume that PIT recovery is being performed at the cluster level with a data backup image created by a tar-like utility. As noted, one of the main problems is knowing where to begin in the log. This can be handled by having backup processing update the control file with the first lsn and log file required. At the time of the backup, this information is or can be made available. The control file can be the last file added to the tar and can contain information spanning the entire backup process. For data consistency, since the backup is being performed on an active cluster, we have to make sure to mark the end of the backup. On restore, to make the cluster consistent, you have to force the user to perform forward recovery passed the point of the backup completion marker in the (archived) log. This can be handled using a backup end log record. The backup end log record would have to contain an identifier unique to this backup. If a user requests to stop PIT recovery before this log record is encountered, consistency is not guaranteed. PIT should either disallow the action or warn of possible / impending doom. The necessary logging for rtee (and others) insertions/deletions can be added to the base code. Not much of a worry but I would expect to encounter other missing log items during testing. The idea of using the last lsn on the page to detect a partial write is used by other dbms systems. You already have that information available so there is no overhead in computing it. Nothing wrong with CRC though. As for the DB_SHUTDOWNED state, this could be handled by having the backup processing update the control file field to DB_PIT_REQUIRED (or some such identifier). After a restore, users would be blocked from connecting to the cluster's databases until a forward recovery passed the backup end log record has completed successfully. At the end of normal crash recovery, the user has to go digging to identify in-flight transactions still in the system and abort them manually. It would be nice if PIT recovery automatically aborted all in-flight transactions at the end. As PostgreSQL heads towards forward recovery functionality, it may be wise to add headers to the log files. As the logs from any cluster are identically named, the header would allow unique identification of the file and contents (cluster name, unique log id, id of the prior log file for chaining purposes, lsn ranges, etc). Most helpful. Just a few notes from the administrative side. PIT recovery should probably offer the user the following actions: . forward recover to end of logs [and stop] Process log files located in the current directory until you read through the last one. Allow the user the option to stop or not, just in case the logs are archived. Send back the timestamp of the last encountered commit log record and the series of log files scanned. . forward recover to PIT [and stop] Similar to that described above but use the commit timestamps to gauge PIT progress. . forward recover query Send back the log series covered and the last commit timestamp encountered. . forward recover stop Stop the current forward recovery session. Undo all in-flight transactions and bring the databases down in a consistent state. No other external user actions should be required. Looking forward to reading draft 2. Cheers, Patrick -- Patrick Macdonald Red Hat Canada "J. R. Nield" wrote: > > Hello: > > I've got the logging system to the point where I can take a shutdown > consistent copy of a system, and play forward through multiple > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE > properly, and things are moving forward well. Recovery to an arbitrary > point-in-time should be just as easy, but will need some administrative > interface for it. > > At this point, some input would be useful on how I should handle things. > > The most important questions that need answering are in sections 2 & 5, > since they impact the most other parts of the system. They will also > require good documentation for sysadmins. > > Issues Outstanding for Point In Time Recovery (PITR) > > $Date: 2002/07/04 14:23:37 $ > > $Revision: 1.4 $ > > J.R. Nield > > (Enc: ISO 8859-15 Latin-9) > > §0 - Introduction > > This file is where I'm keeping track of all the issues I run into while > trying to get PITR to work properly. Hopefully it will evolve into a > description of how PITR actually works once it is implemented. > > I will also try to add feedback as it comes in. > > The big items so-far are: > §1 - Logging Relation file creation, truncation, and removal > This is mostly done. Can do infinte play-forward from > online logs. > §2 - Partial-
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Bruce Momjian wrote: > You are saying, "How do we know what WAL records go with that backup > snapshot of the file?" OK, lets assume we are shutdown. You can grab > the WAL log info from pg_control using contrib/pg_controldata and that > tells you what WAL logs to roll forward when you need to PIT recover > that backup later. If you store that info in the first file you backup, > you can have that WAL pointer available for later recovery in case you > are restoring from that backup. Is that the issue? > > What seems more complicated is doing the backup while the database is > active, and this may be a requirement for a final PITR solution. Some > think we can grab the WAL pointer at 'tar' start and replay that on the > backup even if the file changes during backup. OK, I think I understand live backups now using tar and PITR. Someone explained this to me months ago but now I understand it. First, a key issue is that PostgreSQL doesn't fiddle with individual items on disk. It reads an 8k block, modifies it, (writes it to WAL if it hasn't been written to that WAL segment before), and writes it to disk. That is key. (Are there cases where don't do this, like pg_controldata?) OK, so you do a tar backup of a file. While you are doing the tar, certain 8k blocks are being modified in the file. There is no way to know what blocks are modified as you are doing the tar, and in fact you could read partial page writes during the tar. One solution would be to read the file using the PostgreSQL page buffer, but even then, getting a stable snapshot of the file would be difficult. Now, we could lock the table and prevent writes while it is being backed up, but there is a better way. We already have pre-change page images in WAL. When we do the backup, any page that was modified while we were backing up is in the WAL. On restore, we can recover whatever tar saw of the file, knowing that the WAL page images will recover any page changes made during the tar. Now, you mentioned we may not want pre-change page images in WAL because, with PITR, we can more easily recover from the WAL rather than having this performance hit for many page writes. What I suggest is a way for the backup tar to turn on pre-change page images while the tar is happening, and turn it off after the tar is done. We already have this TODO item: * Turn off after-change writes if fsync is disabled (?) No sense in doing after-change WAL writes without fsync. We could extend this so those after-changes writes could be turned on an off, allowing fill tar backups and PITR recovery. In fact, for people with reliable hardware, we should already be giving them the option of turning off pre-change writes. We don't have a way of detecting partial page writes, but then again, we can't detect failures with fsync off anyway so it seems to be the same vulnerability. I guess that's why we were going to wrap the effect into the same variable, but for PITR, can see wanting fsync always on and the ability to turn pre-change writes on and off. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
J. R. Nield wrote: > On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote: > > > > We have needed > > point-in-time recovery for a long time, > > Most thanks should go to vadim (and whoever else worked on this), since > his WAL code already does most of the work. The key thing is auditing > the backend to look for every case where we assume some action is not > visible until after commit, and therefore don't log its effects. Those > are the main cases that must be changed. Yep. Glad you can focus on that. > > --- > > > > J. R. Nield wrote: > > > Hello: > > > > > > I've got the logging system to the point where I can take a shutdown > > > consistent copy of a system, and play forward through multiple > > > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE > > But notably not for the btree indexes! It looked like they were working, > because the files were there, and all indexes created before the backup > would work under insert/delete (including sys catalog indexes). This is > because btree insert/delete is logged, just not during build. So I > missed that one case. > > You will end-up with up-to-date table data though, so it is something. > > Adding logging support to btbuild is the next step, and I don't think it > should be too hard. I am working this now. Great. > It is also a major advantage that most everything in the system gets > stored in the catalog tables, and so is logged already. > > > > Uh, we do log pre-page writes to WAL to recover from partial page > > writes to disk. Is there something more we need here? > > > > As for bad block detection, we have thought about adding a CRC to each > > page header, or at least making it optional. WAL already has a CRC. > > > > Yes this should be last to do, because it is not necessary for PITR, > only for performance (the option not to write pre-images without fear of > data loss). Yep. > > Yes, there are a few places where we actually create a file, and if the > > server crashes, the file remains out there forever. We need to track that > > better. > > OK, there is a bigger problem then just tracking the file though. We > sometimes do stuff to that file that we don't log. We assume that if we > commit, the file must be OK and will not need replay because the > transaction would not have committed if the file was not in a commitable > state. If we abort, the system never sees the file, so in a sense we > undo everything we did to the file. It is a kind of poor-man's rollback > for certain operations, like btbuild, create table, etc. But it means > that we cannot recover the file from the log, even after a commit. Yep. > > > > > > > > ?1.1.1 - CREATE DATABASE is also unlogged > > > > > > This will cause the same replay problems as above. > > > > Yep. Again, seems a master cleanup on startup is needed. > > The cleanup is not the problem, only a nuisance. Creating the files > during replay is the problem. I must recreate CREATE DATABASE from the > log exactly as it was done originally. I think just logging the > parameters to the command function should be sufficient, but I need to > think more about it. OK, makes sense. Nice when you can bundle a complex action into the logging of one command and its parameters. > > > > > b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table > > > may be used in a future command, and a replay-recovered database may > > > end-up with different data than the original. > > > > We number based on oids. You mean oid wraparound could cause the file > > to be used again? > > That's not what I meant. Let's say I issue 'TRUNCATE TABLE foo'. Then, > right before smgrtruncate is called, I do an XLogInsert saying "Redo a > TRUNCATE TABLE on foo to nblocks if we crash". Then smgrtruncate fails > and we do an elog(ERROR) > > Now the user decides that since TRUNCATE TABLE didn't work, he might as > well use the table, so he inserts some records into it, generating log > entries. > > When I replay this log sequence later, what happens if the TRUNCATE > succeeds instead of failing? You mean the user is now accessing a partially truncated table? That's just too weird. I don't see how the WAL would know how far truncation had gone. I see why you would need the panic and it seems acceptable. > I admit that there are other ways of handling it than to PANIC if the > truncate fails. All the ones I can come up with seem to amount to some > kind of ad-hoc UNDO log. Yea, truncate failure seems so rare/impossible to happen, we can do a panic and see if it ever happens to anyone. I bet it will not. Those are usually cases of an OS crash, so it is the same as a panic. > > > WAL must be flushed before truncate as well. > > > > > > WAL does not need to be flushed before create, if we don't mind > > > leaking files sometimes. > > > > Cleanup? > > Yes, we could garbage
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
> Let me re-write it, and I'll post it in the next version. The section > dealt with what to do when you have a valid restored controlfile from a > backup system, which is in the DB_SHUTDOWNED state, and that points to a > valid shutdown/checkpoint record in the log; only the checkpoint record > happens not to be the last one in the log. This is a situation that > could never happen now, but would in PITR. But it would need to be restore's responsibility to set the flag to DB_IN_PRODUCTION, no? > Even if we shutdown before we copy the file, we don't want a file that > hasn't been written to in 5 weeks before it was backed up to require > five weeks of old log files to recover. So we need to track that > information somehow, because right now if we scanned the blocks in the > file looking for at the page LSN's, we greatest LSN we would see might > be much older than where it would be safe to recover from. That is the > biggest problem, I think. Well, if you skip a validity test it could be restore's responsibility to know which checkpoint was last before the file backup was taken. (When doing a backup you would need to include the last checkpoint info == pg_control at start of backup) Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
On Thu, 2002-07-04 at 11:45, J. R. Nield wrote: One other item that should be here: > The big items so-far are: > §1 - Logging Relation file creation, truncation, and removal > This is mostly done. Can do infinte play-forward from > online logs. > §2 - Partial-Write and Bad Block detection > Need input before starting. Migration issues. > §3 - Detecting Shutdown Consistent System Recovery > Mostly done. > §4 - Interactive Play-Forward Recovery for an Entire System > Need input before starting. > §5 - Individual file consistent recovery > Need input. Semi-Major changes required. §6 - btbuild is not logged Not logged because of same assumptions as for file create. Only need to log the index build parameters to recreate the index, not each page change. -- J. R. Nield [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Issues Outstanding for Point In Time Recovery (PITR)
Hello: I've got the logging system to the point where I can take a shutdown consistent copy of a system, and play forward through multiple checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE properly, and things are moving forward well. Recovery to an arbitrary point-in-time should be just as easy, but will need some administrative interface for it. At this point, some input would be useful on how I should handle things. The most important questions that need answering are in sections 2 & 5, since they impact the most other parts of the system. They will also require good documentation for sysadmins. Issues Outstanding for Point In Time Recovery (PITR) $Date: 2002/07/04 14:23:37 $ $Revision: 1.4 $ J.R. Nield (Enc: ISO 8859-15 Latin-9) §0 - Introduction This file is where I'm keeping track of all the issues I run into while trying to get PITR to work properly. Hopefully it will evolve into a description of how PITR actually works once it is implemented. I will also try to add feedback as it comes in. The big items so-far are: §1 - Logging Relation file creation, truncation, and removal This is mostly done. Can do infinte play-forward from online logs. §2 - Partial-Write and Bad Block detection Need input before starting. Migration issues. §3 - Detecting Shutdown Consistent System Recovery Mostly done. §4 - Interactive Play-Forward Recovery for an Entire System Need input before starting. §5 - Individual file consistent recovery Need input. Semi-Major changes required. §1 - Logging Relation file creation, truncation, and removal §1.1 - Problem: Without file creation in the log, we can't replay committed transactions that create relations. The current code assumes that any transaction reaching commit has already ensured it's files exist, and that those files will never be removed. This is true now, but not for log-replay from an old backup database system. The current XLOG code silently ignores block-write requests for non-existent files, and assumes that the transaction generating those requests must have aborted. Right now a crash during TRUNCATE TABLE will leave the table in an inconsistent state (partially truncated). This would not work when doing replay from before the last checkpoint. §1.1.1 - CREATE DATABASE is also unlogged This will cause the same replay problems as above. §1.2 - Proposal: a) Augment the SMGR code to log relation file operations, and to handle redo requests properly. This is simple in the case of create. Drop must be logged only IN the commit record. For truncate see (b). The 'struct f_smgr' needs new operations 'smgr_recreate', 'smgr_reunlink', and 'smgr_retruncate'. smgr_recreate should accept a RelFileNode instead of a Relation. Transactions that abort through system failure (ie. unlogged aborts) will simply continue to leak files. b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table may be used in a future command, and a replay-recovered database may end-up with different data than the original. WAL must be flushed before truncate as well. WAL does not need to be flushed before create, if we don't mind leaking files sometimes. c) Redo code should treat writes to non-existent files as an error. Changes affect heap & nbtree AM's. [Check others] d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of a database should mark all the rtree indices as corrupt. [ actually we should do that now, are we? ] e) CREATE DATABASE must be logged properly, not use system(cp...) §1.3 - Status: All logged SMGR operations are now in a START_CRIT_SECTION()/ END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs operations. Code has been added to smgr and xact modules to log: create (no XLogFlush) truncate (XLogFlush) pending deletes on commit record files to delete on abort record Code added to md.c to support redo ops Code added to smgr for RMGR redo/desc callbacks Code added to xact RMGR callbacks for redo/desc Database will do infinite shutdown consistent system recovery from the online logs, if you manually munge the control file to set state == DB_IN_PRODUCTION instead of DB_SHUTDOWNED. Still need to do: Item (c), recovery cleanup in all AM's Item (d), logging in other index AM's Item (e), CREATE DATABASE stuff §2 - Partial-Write and Bad Block detection §2.1 - Problem: In order to protect against partial writes without logging pages twice, we need to detect partial pages in system files and report them to the system administrator. We also might want to be able to det