Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-21 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 More verbosely (not numbered because they're not a sequence or
 progression)

 - if no recovery.conf is present we do crash recovery to end of logs on
 pg_control timeline (i.e. current) 

Check.

 - if recovery.conf is present and we do not specify a target we do
 archive recovery to end of logs on pg_control timeline (i.e. current) 

I have done it this way for now, but I'm unconvinced whether this is the
best default --- it might be that we'd be better off making 'latest' be
the default.  The point here is that when you restore a tar backup,
'current' is going to become the timeline that was current when the
backup was made, not the one that was current just before you wiped
$PGDATA.  I'm not really sure which case is going to be more commonly
wanted.

 - if recovery.conf is present and we specify a target, but no timeline,
 then we do archive recovery on the pg_control timeline only, and assume
 that the target was supposed to be on this, even if we don't find it

Whether you specify a target stopping point does not matter AFAICS.  The
timeline selection has to be made before we can even look at the data.

 - if recovery.conf is present and we specify a timeline of literally
 'latest' (without having to know what that is) - then we search archive
 for the latest history file, then we do archive recovery from the
 pg_control timeline to the latest timeline as specified in the latest
 history file. If we specify a target, then this is searched for on
 whatever timeline we're on as we rollforward.

Check.

 - if recovery.conf is present and we specify a timeline - then we search
 archive for that history file, then we do archive recovery from the
 pg_control timeline to the specified timeline as shown in that history
 file. If we specify a target, then this is searched for on whatever
 timeline we're on as we rollforward.

Check.

 I don't like the name target_in_timeline,
 
 Agreed, but I don't have a better name offhand for it.

For lack of any better idea, I have swallowed my objections to target
and called it recovery_target_timeline.  We can easily rename the
parameter if anyone comes up with something more compelling.

Above behavior is all committed to CVS as of a few minutes ago.

 Another thing I note is that archive_status .ready messages are written
 for all restored xlog files (rather than .done messages).

I think this is gone now.  However, we still have the issue of preventing
re-archival of old, incomplete XLOG segments that might be brought back
into pg_xlog/ as a result of restoring a tar backup.  I don't have a
better solution to that than the one Bruce and I proposed yesterday
(make the DBA clean out pg_xlog before starting a recovery run).

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-21 Thread Simon Riggs
On Wed, 2004-07-21 at 23:42, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  More verbosely (not numbered because they're not a sequence or
  progression)
 
  - if no recovery.conf is present we do crash recovery to end of logs on
  pg_control timeline (i.e. current) 
 
 Check.
 
  - if recovery.conf is present and we do not specify a target we do
  archive recovery to end of logs on pg_control timeline (i.e. current) 
 
 I have done it this way for now, but I'm unconvinced whether this is the
 best default --- it might be that we'd be better off making 'latest' be
 the default.  The point here is that when you restore a tar backup,
 'current' is going to become the timeline that was current when the
 backup was made, not the one that was current just before you wiped
 $PGDATA.  I'm not really sure which case is going to be more commonly
 wanted.

Right now, that sounds the best option. But my head hurts :)

 
  - if recovery.conf is present and we specify a target, but no timeline,
  then we do archive recovery on the pg_control timeline only, and assume
  that the target was supposed to be on this, even if we don't find it
 
 Whether you specify a target stopping point does not matter AFAICS.  The
 timeline selection has to be made before we can even look at the data.
 

Yes, I was describing a case where a default behaviour would be required
to make the timeline selection before the desired behaviour could be
enacted. 

  - if recovery.conf is present and we specify a timeline of literally
  'latest' (without having to know what that is) - then we search archive
  for the latest history file, then we do archive recovery from the
  pg_control timeline to the latest timeline as specified in the latest
  history file. If we specify a target, then this is searched for on
  whatever timeline we're on as we rollforward.
 
 Check.
 
  - if recovery.conf is present and we specify a timeline - then we search
  archive for that history file, then we do archive recovery from the
  pg_control timeline to the specified timeline as shown in that history
  file. If we specify a target, then this is searched for on whatever
  timeline we're on as we rollforward.
 
 Check.
 
  I don't like the name target_in_timeline,
  
  Agreed, but I don't have a better name offhand for it.
 
 For lack of any better idea, I have swallowed my objections to target
 and called it recovery_target_timeline.  We can easily rename the
 parameter if anyone comes up with something more compelling.
 
 Above behavior is all committed to CVS as of a few minutes ago.
 

...very cool.

OK, back to first principles as a cross-check then:

PITR should cope with these scenarios. These are described reasonably
closely but not as exact mechanical tests, so as to ensure that if
multiple solutions exist to these recovery scenarios that all paths are
tested.

These are written with a view to *rough* functionality of timelines,
rather than reading the above and making up cases to fit. I suggest we
see if these all work, see why not (if not) and make up some other cases
to make sure all possibilities are catered for.

1. We crash, and wish to recover, as per 7.4

2. We are running happily, using an automated standby database. The
first database fails irrecoverably and we are forced to switch to the
second system which recovers quickly to end of logs, though without the
partially full current xlog from the downed system.

3. We are running happily, but spot a rogue transaction that we wish to
expunge. We decide to run a PITR up to that txnid. We do an archive
recovery to a recovery_target_xid. We have available to us local copies
of the xlogs if required.

4. We perform (3), then after operating for an hour, we realise that
this was an extremely bad idea and decide to recover back to the point
BEFORE we started to recover the first time - i.e. to try to pretend we
had never attempted PITR in the first place because there was some even
more important data just recently committed we didn't know about.

5. We attempt (4) but fail because the then-current log, which has not
been archived, was deleted because we wouldn't need it anymore. We
decide that we made the right choice in the first place and decide to
re-run the PITR, though to a point slightly ahead of where we stopped
last time we tried that.

6. We are running a distributed system that does not properly support
two-phase commit in all of its persistent components. One of the other
components fails (of course not pg!) and we are forced to do a PITR to a
point in time that matches the best last known timestamp of all
persistent system components. We PITR to a recovery_target_time.

7. We have just done (6), but 10 minutes into production we realise that
the clocks between 2 of our systems were out by 3 seconds. Not much, but
it is causing serious errors to bang around the system. We decide to
re-run the previous PITR, but this time to a point 3 seconds further
along the same chain of xlogs. We 

Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-21 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 PITR should cope with these scenarios. These are described reasonably
 closely but not as exact mechanical tests, so as to ensure that if
 multiple solutions exist to these recovery scenarios that all paths are
 tested.

 [ snip ... ]

Now *my* head is hurting ;-)

AFAICS, we should be able to handle all of these cases, as long as
(a) the DBA doesn't manually delete any xlog files (else he probably
loses the chance to recover to the associated timeline), and
(b) the DBA doesn't screw up on picking which timeline to recover to.

I'd guess that (b) is likely to be the bigger threat :-(.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-21 Thread Christopher Kings-Lynne
That gives us enough to talk through and begin some testing.
Anybody have any other horror stories, bring 'em on.
I think that the PITR docs will have to be written in two sections.  One 
will need to be a pure reference that orthogonally describes the 
options, etc.  The other section will need to be a scenario-based 
explanation of what to do/how to recover in all the major different 
failure patterns.  It's the only way people (I!) will understand it all.

From my point of view, what I need PITR to be able to do is allow me to 
restore to any point in the 24 hour period between pg_dumpalls.  I also 
need to know what the exact criteria for deleting archived logs every 24 
hours, and how that can be determined automatically in a script 
(checking the pg_dumpall end-of-log marker exists as well).  I need to 
be told to copy, not move the logs.  Also, I need to be sure that 
pg_dumpall is enough, and I don't need to make sure I issue a checkpoint 
before the pg_dumpall or anything.

Chris
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-20 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Anybody see any holes in this design?

God help the DBA who deletes a history file with needed information. Or edits
it inappropriately.

Why can't every log file contain a header that says which timeline it's part
of and which timeline the preceding log file was part of? That would avoid
having a file outside PGDATA and would mean the backups would always contain
enough information to be self-sufficient.

Then if you want to restore from a cold backup and apply PITR up to segment
0020 in timeline X. You read the header of X.0020, find out it followed X.0019
and so on. If X.0010 branched from Y.0009 you'll find out and be able to
continue threading back until you find a segment that matches the current
segment in the cold backup.

The only problem I see is that if your backups are stored on tape it might be
awkward to have to read the headers of all log segments in reverse order to
backtrack to the right place to start.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-20 Thread Simon Riggs
On Tue, 2004-07-20 at 00:58, Tom Lane wrote: 
 Simon Riggs [EMAIL PROTECTED] writes:
  where the default is notarget and if you specify a target, the default
  target_in_timeline is latest.
 
 I think actually the default target has to be the timeline ID found in
 pg_control -- otherwise you get weird behavior in the plain crash
 recovery, non-PITR case.

Yes, I was talking about recovery.conf only

So, overall there would be 5 recovery modes with 4 levels of default:

Summarised in the table, also showing what set of actions/parameters we
need to specify to invoke that mode...

SPECIFIED ITEMs
MODErecovery.conf   rec_target* target_in_timeline
crash recovery
- up to 7.4-no--no--no
archive recovery 
- current end---yes-no--no
- current target (*)yes-yes-no
  current timeline
- other target (*)--yes-either--yes, = 'latest'
  latest timeline
- other target (*)--yes-either--yes, = 'value'
  other timeline

(*) these operations cause a new timeline to be created

More verbosely (not numbered because they're not a sequence or
progression)

- if no recovery.conf is present we do crash recovery to end of logs on
pg_control timeline (i.e. current) 

- if recovery.conf is present and we do not specify a target we do
archive recovery to end of logs on pg_control timeline (i.e. current) 

- if recovery.conf is present and we specify a target, but no timeline,
then we do archive recovery on the pg_control timeline only, and assume
that the target was supposed to be on this, even if we don't find it

- if recovery.conf is present and we specify a timeline of literally
'latest' (without having to know what that is) - then we search archive
for the latest history file, then we do archive recovery from the
pg_control timeline to the latest timeline as specified in the latest
history file. If we specify a target, then this is searched for on
whatever timeline we're on as we rollforward.

- if recovery.conf is present and we specify a timeline - then we search
archive for that history file, then we do archive recovery from the
pg_control timeline to the specified timeline as shown in that history
file. If we specify a target, then this is searched for on whatever
timeline we're on as we rollforward.

  I don't like the name target_in_timeline,
 
 Agreed, but I don't have a better name offhand for it.  The point I was
 making is that we seem to be using target to mean a point-in-time
 stopping target.  But you might be interested in going to the end of
 timeline N and so there's not a target in that sense.  That's why I
 was wanting to avoid using the term target for the desired timeline.
 But maybe there's not a better word.

how about?

end_at_timeline

which is more neutral towards the idea of whether a target has been
specified or not...


Another thing I note is that archive_status .ready messages are written
for all restored xlog files (rather than .done messages). That seems to
cause the archive to be completely overwritten when recovery is
complete. Is that part of the plan for timelines also. Not sure I
understand that...


Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Simon Riggs
On Mon, 2004-07-19 at 04:31, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  The way you write this makes me think you might mean you would allow: we
  can start recovering in one timelines, then rollforward takes us through
  all the timeline nexus points required to get us to the target
  timeline.
 
 Sure.  Let's draw a diagram:
 
   0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ...
 |
 + 0002.0016 - 0002.0017 - ...
 |
 + 0003.0017 - ...
 
 If you decide you would like to recover to someplace in timeline 0002,
 you need to take the 0002 log files where they exist, and the 0001
 log files where there is no 0002, except you do not revert to 0001
 once you have used an 0002 file (this restriction is needed in case
 the 0001 timeline goes to higher segment numbers than 0002 has reached).
 In no case do you use an 0003 file.
 
  I had imagined that recovery would only ever be allowed to start and end
  on the same timeline. I think you probably mean that?
 
 Logically it's all one timeline, I suppose, but to implement it
 physically that way would mean duplicating all past 0001 segments when
 we want to create the 0002 timeline.  That's not practical and not
 necessary.
 
  Another of the issues I was thinking through was what happens at the end
  of your scenario abobe
  - You're on timeline 1 and you need to perform recovery.
  - You perform recovery and timeline 2 is created.
  - You discover another error and decide to recover again.
  - You recover timeline 1 again: what do you name the new timeline
  created? 2 or 3?
 
 You really want to call it 3.  To enforce this mechanically would
 require having a counter that sits outside the $PGDATA directory and
 is not subject to being reverted by a restore-from-backup.  I don't
 see any very clean way to do that at the moment --- any thoughts?
 
 In the absence of such a counter we could ask the DBA to specify a new
 timeline number in recovery.conf, but this strikes me as one of those
 easy-to-get-wrong things ...
 
 One possibility is to extend the archiving API so that we can inquire
 about the largest timeline number that exists anywhere in the archive.
 If we take new timeline number = 1 + max(any in archive, any in pg_xlog)
 then we are safe.  But I'm not really convinced that such a thing would
 be any less error-prone than the manual way :-(, because for any
 archival method that's more complicated than cp them all into one
 directory, it'd be hard to extract the max stored filename.
 

Think the same as you do on all of that. Excellent.

Some further thinking from that base...

Perhaps timelines should be nest-numbered: (using 0 as a counter also)
0 etc is the original branch
0.1 is the first recovery off the original branch
0.2 is the second recovery off the original branch
0.1.1 is the first recovery off the first recovery (so to speak)
0.1.2 is the second etc
That way you don't have the problem of which is 3? in the examples
above. [Would we number a recovery of 1 as 3 or would then next recovery
off 2 be numbered 3?]

Not necessarily the way we would show that as a timeline number. It
could still be shown as a single hex number representing each nesting
level as 4 bits...(restricting us to 7 recoveries per timeline...)

Just as a thought, DB2 uses the concept of a history file also

If we go with the renaming recovery.conf when it completes, why not make
that the record of previous recoveries? Move it to archive_status and
name it according to the timeline it just created, e.g.
recovery.done.timeline.timestamp

If you're re-restoring into the same directory you wouldn't then
overwrite the history of previous recoveries.

It would be an extremely bad thing to have to specify the timeline
number, and I agree we can't ask the archive. 

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Some further thinking from that base...

 Perhaps timelines should be nest-numbered: (using 0 as a counter also)
 0 etc is the original branch
 0.1 is the first recovery off the original branch
 0.2 is the second recovery off the original branch
 0.1.1 is the first recovery off the first recovery (so to speak)
 0.1.2 is the second etc
 That way you don't have the problem of which is 3? in the examples
 above. [Would we number a recovery of 1 as 3 or would then next recovery
 off 2 be numbered 3?]

Hmm.  This would have some usefulness as far as documenting how did we
get here, but unless you knew where/when the timeline splits had
occurred, I don't think it would be super useful.  It'd make more sense
to record the parentage and split time of each timeline in some
human-readable meta-history reference file (but where exactly?)

I don't think it does anything to solve our immediate problem, anyhow.
You may know that you are recovering off of branch 0.1, but how do you
know if this is the first, second, or Nth time you have done that?

 Not necessarily the way we would show that as a timeline number. It
 could still be shown as a single hex number representing each nesting
 level as 4 bits...(restricting us to 7 recoveries per timeline...)

Sounds too tight to me :-(

I do see a hole in my original concept now that you mention it.  It'd be
quite possible for timeline 2 *not* to be an ancestor of timeline 3,
that is you might have tried a restore, not liked the result, and
decided to re-restore from someplace else on timeline 1.  That is,
instead of

0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ...
  |
  + 0002.0016 - 0002.0017 - ...
  |
  + 0003.0017 - ...

maybe the history is

0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ...
  |   |
  |   + 0003.0017 - ...
  |
  + 0002.0016 - 0002.0017 - ...

where I've had to draw 3 above 2 to avoid having unrelated lines
crossing each other in my diagram.  The problem here is that a crash
recovery in timeline 3 would not realize that it should not use WAL
segment 0002.0016.  So we need a more sophisticated rule than just
numerical comparison of timeline numbers.

I think your idea of nested numbers might fix this, but I'm concerned
about the restrictions of fitting it into 32 bits as you suggest.
Can we think of a less restrictive representation?

 If we go with the renaming recovery.conf when it completes, why not make
 that the record of previous recoveries? Move it to archive_status and
 name it according to the timeline it just created, e.g.
 recovery.done.timeline.timestamp

There's still the problem of how can you be sure that all the files
created in the past are still in there.  It'd be way too likely for
someone to decide they ought to do a recovery restore by first doing
rm -rf $PGDATA.  Or they lost the disk entirely and are restoring
their last full backup onto virgin media.

I think there's really no way around the issue: somehow we've got to
keep some meta-history outside the $PGDATA area, if we want to do this
in a clean fashion.  We could perhaps expect the archive area to store
it, but I'm a bit worried about the idea of overwriting the meta-history
file in archive from time to time; it's mighty critical data and you'd
not be happy if a crash corrupted your only copy.  We could archive
meta-history files with successively higher versioned names ... but then
we need an API extension to get back the latest one.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Tom Lane
I wrote:
 I think there's really no way around the issue: somehow we've got to
 keep some meta-history outside the $PGDATA area, if we want to do this
 in a clean fashion.

After further thought I think we can fix this stuff by creating a
history file for each timeline.  This will make recovery slightly more
complicated but I don't think it would be any material performance
problem.  Here's how it goes:

* Timeline IDs are 32-bit ints with no particular semantic significance
(that is, we do not assume timeline 3 is a child of 2, or anything like
that).  The actual parentage of a timeline has to be found by inspecting
its history file.

* History files will be named by their timeline ID, say 0042.history.
They will be created in /pg_xlog whenever a new timeline is created
by the act of doing a recovery to a point in time earlier than the end
of existing WAL.  When doing WAL archiving a history file can be copied
off to the archive area by the existing archiver mechanism (ie, we'll
make a .ready file for it as soon as it's written).

* History files will be plain text (for human consumption) and will
essentially consist of a list of parent timeline IDs in sequence.
I envision adding the timeline split timestamp and starting WAL segment
number too, but these are for documentation purposes --- the system
doesn't need them.  We may as well allow comments in there as well,
so that the DBA can annotate the reasons for a PITR split to have been
done.  So the contents might look like

# Recover from unintentional TRUNCATE
0001000A001425682005-05-16 12:34:15 EDT
# Ex-assistant DBA dropped wrong table
0007002254342005-11-17 18:44:44 EST

When we split off a new timeline, we just have to copy the parent's
history file (which we can do verbatim including comments) and then
add a new line at the end showing the immediate parent's timeline ID
and the other details of the split.  Initdb can create 0001.history
with empty contents (since that timeline has no parents).

* When we need to do recovery, we first identify the source timeline
(either by reading the current timeline ID from pg_control, or the DBA
can tell us with a parameter in recovery.conf).  We then read the
history file for that timeline, and remember its sequence of parent
timeline IDs.  We can crosscheck that pg_control's timeline ID is
one of this set of timeline IDs, too --- if it's not then the wrong
backup file was restored.

* During recovery, whenever we need to open a WAL segment file, we first
try to open it with the source timeline ID; if that doesn't exist, try
the immediate parent timeline ID; then the grandparent, etc.  Whenever
we find a WAL file with a particular timeline ID, we forget about all
parents further up in the history, and won't try to open their segments
anymore (this is the generalization of my previous rule that you never
drop down in timeline number as you scan forward).

* If we end recovery because we have rolled forward off the end of WAL,
we can just continue using the source timeline ID --- we are extending
that timeline.  (Thus, an ordinary crash and restart doesn't require
generating a new timeline ID; nor do we generate a new line during
normal postmaster stop/start.)  But if we stop recovery at a requested
point-in-time earlier than end of WAL, we have to branch off a new
timeline.  We do this by:
* Selecting a previously unused timeline ID (see below).
* Writing a history file for this ID, by copying the parent
  timeline's history file and adding a new line at the end.
* Copying the last-used WAL segment of the parent timeline,
  giving it the same segment number but the new timeline's ID.
  This becomes the active WAL segment when we start operating.

* We can identify the highest timeline ID ever used by simply starting
with the source timeline ID and probing pg_xlog and the archive area
for history files N+1.history, N+2.history, etc until we find an ID
for which there is no history file.  Under reasonable scenarios this
will not take very many probes, so it doesn't seem that we need any
addition to the archiver API to make it more efficient.

* Since history files will be small and made infrequently (one hopes you
do not need to do a PITR recovery very often...) I see no particular
reason not to leave them in /pg_xlog indefinitely.  The DBA can clean
out old ones if she is a neatnik, but I don't think the system needs to
or should delete them.  Similarly the archive area could be expected to
retain history files indefinitely.

* However, you *can* throw away a history file once you are no longer
interested in rolling back to times predating the splitoff point of the
timeline.  If we don't find a history file we can just act as though the
timeline has no parents (extends indefinitely far in the past).  (Hm,
so we don't actually have to bother creating 0001.history...)

* 

Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Simon Riggs
On Mon, 2004-07-19 at 16:58, Tom Lane wrote:
 I think there's really no way around the issue: somehow we've got to
 keep some meta-history outside the $PGDATA area, if we want to do this
 in a clean fashion.  We could perhaps expect the archive area to store
 it, but I'm a bit worried about the idea of overwriting the meta-history
 file in archive from time to time; it's mighty critical data and you'd
 not be happy if a crash corrupted your only copy.  We could archive
 meta-history files with successively higher versioned names ... but then
 we need an API extension to get back the latest one.
 

Yes, you've convinced me.

It is critical data, but never for that long. If we only split timelines
when we recover, then we just make not to take about ~100 copies of it
immediately. If we really did recover OK, then it'll only be a few
days/weeks before we can forget it ever happened.

The crucial time is when re-running recoveries repeatedly and if we
write the manual with sufficient red ink then we'll avoid this. But
heck, not having your history file is only as bad as not having added
timelines in the first place. Not great, just more care required.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Simon Riggs
On Mon, 2004-07-19 at 19:33, Tom Lane wrote:
 I wrote:
  I think there's really no way around the issue: somehow we've got to
  keep some meta-history outside the $PGDATA area, if we want to do this
  in a clean fashion.
 
 After further thought I think we can fix this stuff by creating a
 history file for each timeline.  This will make recovery slightly more
 complicated but I don't think it would be any material performance
 problem.  Here's how it goes:

Yes...I came to the conclusion that trying to avoid doing something like
DB2 does was just stubornness on my part. We may as well use analogies
with other systems when they are available.

All of this is good. Two main areas of comments/questions, noted (**)

Timelines should be easy to understand for anybody that can follow a
HACKERS conversation anyhow :)

 
 * Timeline IDs are 32-bit ints with no particular semantic significance
 (that is, we do not assume timeline 3 is a child of 2, or anything like
 that).  The actual parentage of a timeline has to be found by inspecting
 its history file.
 

OK...thats better. The nested idea doesn't read well second time
through.

 * History files will be named by their timeline ID, say 0042.history.
 They will be created in /pg_xlog whenever a new timeline is created
 by the act of doing a recovery to a point in time earlier than the end
 of existing WAL.  When doing WAL archiving a history file can be copied
 off to the archive area by the existing archiver mechanism (ie, we'll
 make a .ready file for it as soon as it's written).
 

Need to check the archive code which relies on file shape and length

 * History files will be plain text (for human consumption) and will
 essentially consist of a list of parent timeline IDs in sequence.
 I envision adding the timeline split timestamp and starting WAL segment
 number too, but these are for documentation purposes --- the system
 doesn't need them.  We may as well allow comments in there as well,
 so that the DBA can annotate the reasons for a PITR split to have been
 done.  So the contents might look like
 
   # Recover from unintentional TRUNCATE
   0001000A001425682005-05-16 12:34:15 EDT
   # Ex-assistant DBA dropped wrong table
   0007002254342005-11-17 18:44:44 EST
 

Or should there be a recovery_comment parameter in the recovery.conf?
That would be better than suggesting that admins can edit such an
important file. (Even if they can, its best not to encourage it).

 When we split off a new timeline, we just have to copy the parent's
 history file (which we can do verbatim including comments) and then
 add a new line at the end showing the immediate parent's timeline ID
 and the other details of the split.  Initdb can create 0001.history
 with empty contents (since that timeline has no parents).

Yes.

Will you then delete the previous timeline's history file or just leave
it there? (OK, you say that later)

 * When we need to do recovery, we first identify the source timeline
 (either by reading the current timeline ID from pg_control, or the DBA
 can tell us with a parameter in recovery.conf).  We then read the
 history file for that timeline, and remember its sequence of parent
 timeline IDs.  We can crosscheck that pg_control's timeline ID is
 one of this set of timeline IDs, too --- if it's not then the wrong
 backup file was restored.

** Surely it is the backup itself that determines the source timeline? 

Backups are always taken in one particular timeline. The rollforward
must start at a checkpoint before the begin backup and roll past the end
of backup marker onwards. The starting checkpoint should be the last
checkpoint prior to backup - why would you pick another? That checkpoint
will always be in the current timeline, since we always come out of
startup with a checkpoint (either because we shutdown earlier, or we
recovered and just wrote another shutdown checkpoint). 

So the backup's timeline will determine the source timeline, but not
necessarily the target timeline.

...thinkingrecovery.conf would need to specify:
recovery_target (if there is one, either a time or txnid)
recovery_target_timeline (if there is one, otherwise end of last one)
recovery_target_history_file (which specifies how the timeline ids are
sequenced)

I take it that your understanding is that the recovery_target timeline
needs to be specified also?

 * During recovery, whenever we need to open a WAL segment file, we first
 try to open it with the source timeline ID; if that doesn't exist, try
 the immediate parent timeline ID; then the grandparent, etc.  Whenever
 we find a WAL file with a particular timeline ID, we forget about all
 parents further up in the history, and won't try to open their segments
 anymore (this is the generalization of my previous rule that you never
 drop down in timeline number as you scan forward).
 

This jigging around is OK, because most people will be using only 

Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 The crucial time is when re-running recoveries repeatedly and if we
 write the manual with sufficient red ink then we'll avoid this. But
 heck, not having your history file is only as bad as not having added
 timelines in the first place. Not great, just more care required.

Yeah, you only really need them when you are hip-deep in repeated
recovery retries.

If you haven't gotten to my later proposal yet, the history files will
be plain text and it'd be at least theoretically possible for someone to
reconstruct one by hand if needed.  All you need to have is the sequence
of parent timeline IDs, which you could reconstruct in most cases by
looking at the archived WAL files.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Mon, 2004-07-19 at 19:33, Tom Lane wrote:
 When doing WAL archiving a history file can be copied
 off to the archive area by the existing archiver mechanism (ie, we'll
 make a .ready file for it as soon as it's written).

 Need to check the archive code which relies on file shape and length

Yeah.  I made some adjustments already with the expectation that we'd
want to do this, but it'll take a little bit more weakening of the
code's tests.

 * When we need to do recovery, we first identify the source timeline
 (either by reading the current timeline ID from pg_control, or the DBA
 can tell us with a parameter in recovery.conf).

 ** Surely it is the backup itself that determines the source timeline? 

The backup determines the starting point, but there may be several
timelines you could follow after that (especially in the scenario where
you're redoing a recovery starting from the same backup).  The point
here is that there could be timeline branches after the backup
occurred.  So yes the backup has to be in an ancestral timeline, but not
necessarily exactly the recovery-target timeline.

 ...thinkingrecovery.conf would need to specify:
 recovery_target (if there is one, either a time or txnid)
 recovery_target_timeline (if there is one, otherwise end of last one)
 recovery_target_history_file (which specifies how the timeline ids are
 sequenced)

No, the source timeline is not necessarily associated with a
recovery_target --- for instance you might want it to run to the end of
a particular timeline.  I suspect it might be more confusing than
helpful to use the term target timeline.

 try to open it with the source timeline ID; if that doesn't exist, try
 the immediate parent timeline ID; then the grandparent, etc.

 This jigging around is OK, because most people will be using only one
 timeline anyhow, so its not likely to cause too much fuss for the user.

It might confuse someone who's watching the sequence of archive
retrieval requests, but as long as that's all mechanized it doesn't
seem like there's any real potential for trouble.

 ** I would prefer to add a random number to the timeline as a way of
 identifying the next one. This will produce fewer probes, so less wasted
 tape mounts,

How do you figure that?  Seems like the same number of probes either way.

 but most importantly it gets round this issue:

 You're on timeline X, then you recover and run for a while on timeline
 Y. You then realise recovering to that target was a really bad idea for
 some reason (some VIPs record wasn't in the recovered data etc). We then
 need to re-recover from the backup on X to a new timeline, Z. But how
 does X know that Y existed when it creates Z?

Because there is a Y.history file laying about (either in the archive or
pg_xlog).

We will need to recommend to DBAs that they not delete Y.history from
the archive unless they've already deleted all Y.whatever log segments.
Once they have done this, the past existence of timeline Y is no longer
of interest and so there'd be no real problem in recycling the ID.
I would say the above is just as true if you use random IDs as if you
use sequential ones.  I distrust systems that assume there will never be
a collision of randomly-chosen IDs.

 If Y = f(x) in a deterministic way, then Y will always == Z. Of course,
 we could provide an id, but what would you pick? The best way is to get
 out of trouble by picking a new timeline id that's very unlikely to have
 been picked before.

I do not see the advantage.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 I think we're heatedly agreeing again.

Yeah, I think so.  I'll get started on this tomorrow.

 where the default is notarget and if you specify a target, the default
 target_in_timeline is latest.

I think actually the default target has to be the timeline ID found in
pg_control --- otherwise you get weird behavior in the plain crash
recovery, non-PITR case.

 I don't like the name target_in_timeline,

Agreed, but I don't have a better name offhand for it.  The point I was
making is that we seem to be using target to mean a point-in-time
stopping target.  But you might be interested in going to the end of
timeline N and so there's not a target in that sense.  That's why I
was wanting to avoid using the term target for the desired timeline.
But maybe there's not a better word.

 ...we definitely need an offline-log inspection tool, don't we? 
 Next month...

Yeah.  When you get started, I have a toy tool I've been using for
awhile that might serve as a starting point.  (I'm going to have to
whack it around for timelines so there's little point in attaching
it right now...)

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-19 Thread Simon Riggs
On Mon, 2004-07-19 at 23:15, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Mon, 2004-07-19 at 19:33, Tom Lane wrote:

  * When we need to do recovery, we first identify the source timeline
  (either by reading the current timeline ID from pg_control, or the DBA
  can tell us with a parameter in recovery.conf).
 
  ** Surely it is the backup itself that determines the source timeline? 
 
 The backup determines the starting point, but there may be several
 timelines you could follow after that (especially in the scenario where
 you're redoing a recovery starting from the same backup).  The point
 here is that there could be timeline branches after the backup
 occurred.  So yes the backup has to be in an ancestral timeline, but not
 necessarily exactly the recovery-target timeline.
 

Agreed.

  ...thinkingrecovery.conf would need to specify:
  recovery_target (if there is one, either a time or txnid)
  recovery_target_timeline (if there is one, otherwise end of last one)
  recovery_target_history_file (which specifies how the timeline ids are
  sequenced)
 
 No, the source timeline is not necessarily associated with a
 recovery_target --- for instance you might want it to run to the end of
 a particular timeline.  I suspect it might be more confusing than
 helpful to use the term target timeline.
 

I think we're heatedly agreeing again.

A summary: we don't specify the start timeline, but we do specify the
timeline which contains our chosen endpoint. [But when we reach it,
we may create a new timeline id if we didn't go to end of logs on that
timeline.] The history file specifies how to get from start to end,
through however many branchpoints there areand the history file we
use for recovery is the one pointed to by (target_in_timeline).

Or even shorter:
- backup specifies starting timeline (and so user specifies indirectly)
- user specifies end point (explicitly in recovery.conf)
- history file shows how to get from start to end

more thoughts...if you specify:
target = X
target_in_timeline

where the default is notarget and if you specify a target, the default
target_in_timeline is latest.

I don't like the name target_in_timeline, I'm just trying to clarify
what we mean so we can think of a better name for it.


...we definitely need an offline-log inspection tool, don't we? 
Next month...

 We will need to recommend to DBAs that they not delete Y.history from
 the archive unless they've already deleted all Y.whatever log segments.
 Once they have done this, the past existence of timeline Y is no longer
 of interest and so there'd be no real problem in recycling the ID.
 I would say the above is just as true if you use random IDs as if you
 use sequential ones.  I distrust systems that assume there will never be
 a collision of randomly-chosen IDs.
 

Yes, I argued myself in a circle, but it seemed worth recording just to
avoid repeating the thought next time.


Best regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-18 Thread Simon Riggs
On Sat, 2004-07-17 at 21:36, Tom Lane wrote:
 If we do not add timeline numbers to WAL file names, we will be forced
 to destroy information during recovery.  Consider the following
 scenario:
 
 1. You have a WAL directory containing, say, WAL segments 0010 to 0020
 (for the purposes of this example I won't bother typing out realistic
 16-digit filenames, but just use 4-digit names).
 
 2. You discover that your junior DBA messed up badly and you need to
 revert to yesterday evening's state.  Let's say the chosen recovery end
 time is in the middle of file 0014.
 
 3. You run the recovery process.  At its end, the WAL end pointer will
 be 0014 and some offset.
 
 If we simply run forward from this situation, then we will be
 overwriting existing WAL records in the existing files 0014-0020.
 This is bad from the point of view of not wanting to discard information
 (what if we decide we should have recovered to a later time??), but
 there is an even more serious reason for not doing that.  Suppose we
 suffer a crash sometime after recovery.  On restart, the system will
 start replaying the logs, and *there will be nothing to keep it from
 replaying all the way to the end of file 0020*.  (The files will contain
 proper, in-sequence page headers, so the tests that normally detect
 recycled log segments won't think there is anything wrong.)  This will
 leave you with a thoroughly corrupt database.
 
 One way to solve this would be to physically discard 0015-0020 as soon
 as we decide we're stopping short of the end of WAL.  I think that is
 unacceptable on don't-throw-away-information grounds.  I think it would
 be far better to invent the timeline concept.  Then, our old WAL files
 would be named say 0001.0010 through 0001.0020, and we would start
 logging into 0002.0014 after recovery.
 
 A slightly tricky point is that we have to sew together the end of one
 timeline and the start of the next --- for instance, we only want the
 front part of 0001.0014, not the back part, to be part of the new
 timeline.  Patrick Macdonald told me about a pretty baroque scheme that
 DB2 uses for this, but I think it would be simplest if we just copied
 the appropriate amount of data from 0001.0014 into 0002.0014 and then
 ran forward from there.  Copying a max of 16MB of data doesn't sound
 very onerous.
 

Well, yes - I completely agree that we need the timeline concept as one
of the highest priorities. I originally raised the problem timelines
solve because of the errors I had experienced re-running restores many
times with the same archive set. It's just too easy to overwrite log
files without the timeline concept.

IMHO you don't need to change the xlog format as a necessary step to
introduce timelines. Simply adding  to the logid is sufficient
(which lets face it takes a heck of long time before it gets to 1...)

[Also, as an extra detail on your analysis, when recovery is finished
you need to move both primary and secondary checkpoint markers forwards
to the new timeline, so that crash recovery can't go back to the old
timeline]

If you're going to change xlog filenames, then I would think that adding
the system identifier to the xlogs would be a very good addition. I
would simply have recommended keeping them in separate directories, but
putting it on the name would be best. PostgreSQL doesn't have a name
concept...which would be the thing to use if it did.

 During WAL replay or recovery, there would be a notion of the target
 timeline that you are trying to recover to a point within.  The rule
 for selecting which WAL segment file to read is use the one with
 largest timeline number less than or equal to the target, and never less
 than the timeline number you used for the previous segment.  So for
 example if we realized we'd chosen the wrong recovery target time, we
 could backpedal and redo the same recovery process with target timeline
 0001, ignoring any WAL segments that had been archived with timeline
 0002.  Alternatively, if we were simply doing crash recovery in timeline
 0002, we could stop at (say) segment 0002.0018, and we'd know that we
 should ignore 0001.0019 because it is not in our timeline.
 

That sounds like the way it should work.

The way you write this makes me think you might mean you would allow: we
can start recovering in one timelines, then rollforward takes us through
all the timeline nexus points required to get us to the target timeline.

I had imagined that recovery would only ever be allowed to start and end
on the same timeline. I think you probably mean that?


Another of the issues I was thinking through was what happens at the end
of your scenario abobe
- You're on timeline 1 and you need to perform recovery.
- You perform recovery and timeline 2 is created.
- You discover another error and decide to recover again.
- You recover timeline 1 again: what do you name the new timeline
created? 2 or 3? If you call it 2 you will be overwriting data just like
you would have done - 

Re: [HACKERS] Why we really need timelines *now* in PITR

2004-07-18 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 The way you write this makes me think you might mean you would allow: we
 can start recovering in one timelines, then rollforward takes us through
 all the timeline nexus points required to get us to the target
 timeline.

Sure.  Let's draw a diagram:

0001.0014 - 0001.0015 - 0001.0016 - 0001.0017 - ...
  |
  + 0002.0016 - 0002.0017 - ...
  |
  + 0003.0017 - ...

If you decide you would like to recover to someplace in timeline 0002,
you need to take the 0002 log files where they exist, and the 0001
log files where there is no 0002, except you do not revert to 0001
once you have used an 0002 file (this restriction is needed in case
the 0001 timeline goes to higher segment numbers than 0002 has reached).
In no case do you use an 0003 file.

 I had imagined that recovery would only ever be allowed to start and end
 on the same timeline. I think you probably mean that?

Logically it's all one timeline, I suppose, but to implement it
physically that way would mean duplicating all past 0001 segments when
we want to create the 0002 timeline.  That's not practical and not
necessary.

 Another of the issues I was thinking through was what happens at the end
 of your scenario abobe
 - You're on timeline 1 and you need to perform recovery.
 - You perform recovery and timeline 2 is created.
 - You discover another error and decide to recover again.
 - You recover timeline 1 again: what do you name the new timeline
 created? 2 or 3?

You really want to call it 3.  To enforce this mechanically would
require having a counter that sits outside the $PGDATA directory and
is not subject to being reverted by a restore-from-backup.  I don't
see any very clean way to do that at the moment --- any thoughts?

In the absence of such a counter we could ask the DBA to specify a new
timeline number in recovery.conf, but this strikes me as one of those
easy-to-get-wrong things ...

One possibility is to extend the archiving API so that we can inquire
about the largest timeline number that exists anywhere in the archive.
If we take new timeline number = 1 + max(any in archive, any in pg_xlog)
then we are safe.  But I'm not really convinced that such a thing would
be any less error-prone than the manual way :-(, because for any
archival method that's more complicated than cp them all into one
directory, it'd be hard to extract the max stored filename.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


[HACKERS] Why we really need timelines *now* in PITR

2004-07-17 Thread Tom Lane
If we do not add timeline numbers to WAL file names, we will be forced
to destroy information during recovery.  Consider the following
scenario:

1. You have a WAL directory containing, say, WAL segments 0010 to 0020
(for the purposes of this example I won't bother typing out realistic
16-digit filenames, but just use 4-digit names).

2. You discover that your junior DBA messed up badly and you need to
revert to yesterday evening's state.  Let's say the chosen recovery end
time is in the middle of file 0014.

3. You run the recovery process.  At its end, the WAL end pointer will
be 0014 and some offset.

If we simply run forward from this situation, then we will be
overwriting existing WAL records in the existing files 0014-0020.
This is bad from the point of view of not wanting to discard information
(what if we decide we should have recovered to a later time??), but
there is an even more serious reason for not doing that.  Suppose we
suffer a crash sometime after recovery.  On restart, the system will
start replaying the logs, and *there will be nothing to keep it from
replaying all the way to the end of file 0020*.  (The files will contain
proper, in-sequence page headers, so the tests that normally detect
recycled log segments won't think there is anything wrong.)  This will
leave you with a thoroughly corrupt database.

One way to solve this would be to physically discard 0015-0020 as soon
as we decide we're stopping short of the end of WAL.  I think that is
unacceptable on don't-throw-away-information grounds.  I think it would
be far better to invent the timeline concept.  Then, our old WAL files
would be named say 0001.0010 through 0001.0020, and we would start
logging into 0002.0014 after recovery.

A slightly tricky point is that we have to sew together the end of one
timeline and the start of the next --- for instance, we only want the
front part of 0001.0014, not the back part, to be part of the new
timeline.  Patrick Macdonald told me about a pretty baroque scheme that
DB2 uses for this, but I think it would be simplest if we just copied
the appropriate amount of data from 0001.0014 into 0002.0014 and then
ran forward from there.  Copying a max of 16MB of data doesn't sound
very onerous.

During WAL replay or recovery, there would be a notion of the target
timeline that you are trying to recover to a point within.  The rule
for selecting which WAL segment file to read is use the one with
largest timeline number less than or equal to the target, and never less
than the timeline number you used for the previous segment.  So for
example if we realized we'd chosen the wrong recovery target time, we
could backpedal and redo the same recovery process with target timeline
0001, ignoring any WAL segments that had been archived with timeline
0002.  Alternatively, if we were simply doing crash recovery in timeline
0002, we could stop at (say) segment 0002.0018, and we'd know that we
should ignore 0001.0019 because it is not in our timeline.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])