On Wed, Mar 31, 2010 at 1:28 AM, Heikki Linnakangas
wrote:
> Fujii Masao wrote:
>>> * Small code changes to handling of failedSources, inspired by your
>>> comment. No change in functionality.
>>>
>>> This is also available in my git repository at
>>> git://git.postgresql.org/git/users/heikki/post
Fujii Masao wrote:
>> * Small code changes to handling of failedSources, inspired by your
>> comment. No change in functionality.
>>
>> This is also available in my git repository at
>> git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges"
>
> I looked the patch and was not
On Thu, Mar 25, 2010 at 9:55 PM, Heikki Linnakangas
wrote:
> * Fix the bug of a spurious PANIC in archive recovery, if the WAL ends
> in the middle of a WAL record that continues over a WAL segment boundary.
>
> * If a corrupt WAL record is found in archive or streamed from master in
> standby mod
On Thu, 2010-03-25 at 12:26 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
> >
> >> PANIC seems like the appropriate solution for now.
> >
> > It definitely is not. Think some more.
>
> Well, what happens now in previous ver
On Thu, 2010-03-25 at 12:15 +0200, Heikki Linnakangas wrote:
> (cc'ing docs list)
>
> Simon Riggs wrote:
> > The lack of docs begins to show a lack of coherent high-level design
> > here.
>
> Yeah, I think you're right. It's becoming hard to keep track of how it's
> supposed to behave.
Thank you
On Thu, Mar 25, 2010 at 8:55 AM, Heikki Linnakangas
wrote:
> * If a corrupt WAL record is found in archive or streamed from master in
> standby mode, throw WARNING instead of PANIC, and keep trying. In
> archive recovery (ie. standby_mode=off) it's still a PANIC. We can make
> it a WARNING too, wh
Fujii Masao wrote:
> On second thought, the following lines seem to be necessary just after
> calling XLogPageRead() since it reads new WAL file from another source.
>
>> if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE)
>> emode = PANIC;
>> else
>>
Fujii Masao wrote:
>> sources &= ~failedSources;
>> failedSources |= readSource;
>
> The above lines in XLogPageRead() seem not to be required in normal
> recovery case (i.e., standby_mode = off). So how about the attached
> patch?
>
> *** 9050,9056 next_record_is_invalid:
> --- 9047,9056 ---
Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
>>
>>> PANIC seems like the appropriate solution for now.
>> It definitely is not. Think some more.
>
> Well, what happens now in previous versions with pg_standby et al is
> that the sta
Simon Riggs wrote:
> On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
>
>> PANIC seems like the appropriate solution for now.
>
> It definitely is not. Think some more.
Well, what happens now in previous versions with pg_standby et al is
that the standby starts up. That doesn't seem
(cc'ing docs list)
Simon Riggs wrote:
> The lack of docs begins to show a lack of coherent high-level design
> here.
Yeah, I think you're right. It's becoming hard to keep track of how it's
supposed to behave.
> By now, I've forgotten what this thread was even about. The major
> design decision
Simon Riggs wrote:
> On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
>> And if the trigger file is
>> found, I think that the startup process should emit a FATAL, i.e., the
>> server should exit immediately, to prevent the server from becoming the
>> primary in a half-finished state.
>
> Pl
On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
> PANIC seems like the appropriate solution for now.
It definitely is not. Think some more.
--
Simon Riggs www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your
On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
> On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs wrote:
> > PANICing won't change the situation, so it just destroys server
> > availability. If we had 1 master and 42 slaves then this behaviour would
> > take down almost the whole server farm at
Tom Lane wrote:
> Fujii Masao writes:
>> OK. How about making the startup process emit WARNING, stop WAL replay and
>> wait for the presence of trigger file, when an invalid record is found?
>> Which keeps the server up for readonly queries. And if the trigger file is
>> found, I think that the st
On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
> On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs wrote:
> > PANICing won't change the situation, so it just destroys server
> > availability. If we had 1 master and 42 slaves then this behaviour would
> > take down almost the whole server farm at
Fujii Masao writes:
> OK. How about making the startup process emit WARNING, stop WAL replay and
> wait for the presence of trigger file, when an invalid record is found?
> Which keeps the server up for readonly queries. And if the trigger file is
> found, I think that the startup process should e
On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs wrote:
> PANICing won't change the situation, so it just destroys server
> availability. If we had 1 master and 42 slaves then this behaviour would
> take down almost the whole server farm at once. Very uncool.
>
> You might have reason to prevent the s
On Wed, 2010-03-24 at 14:31 +0200, Heikki Linnakangas wrote:
> Fujii Masao wrote:
> > But in the current (v8.4 or before) behavior, recovery ends normally
> > when an invalid record is found in an archived WAL file. Otherwise,
> > the server would never be able to start normal processing when there
On Wed, Mar 24, 2010 at 10:20 PM, Fujii Masao wrote:
>> Thanks. That's easily fixable (applies over the previous patch):
>>
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -3773,7 +3773,7 @@ retry:
>> pagelsn.xrecoff = 0;
>> }
>
On Wed, Mar 24, 2010 at 9:31 PM, Heikki Linnakangas
wrote:
> Hmm, true, this changes behavior over previous releases. I tend to think
> that it's always an error if there's a corrupt file in the archive,
> though, and PANIC is appropriate. If the administrator wants to start up
> the database anyw
Fujii Masao wrote:
> But in the current (v8.4 or before) behavior, recovery ends normally
> when an invalid record is found in an archived WAL file. Otherwise,
> the server would never be able to start normal processing when there
> is a corrupted archived file for some reasons. So, that invalid re
Sorry for the delay.
On Fri, Mar 19, 2010 at 8:37 PM, Heikki Linnakangas
wrote:
> Here's a patch I've been playing with.
Thanks! I'm reading the patch.
> The idea is that in standby mode,
> the server keeps trying to make progress in the recovery by:
>
> a) restoring files from archive
> b) rep
Alvaro Herrera wrote:
> Heikki Linnakangas escribió:
>
>> When recovery reaches an invalid WAL record, typically caused by a
>> half-written WAL file, it closes the file and moves to the next source.
>> If an error is found in a file restored from archive or in a portion
>> just streamed from mast
Heikki Linnakangas escribió:
> When recovery reaches an invalid WAL record, typically caused by a
> half-written WAL file, it closes the file and moves to the next source.
> If an error is found in a file restored from archive or in a portion
> just streamed from master, however, a PANIC is thrown
Tom Lane wrote:
> Heikki Linnakangas writes:
>> Simon Riggs wrote:
>>> We might also have written half a file many times. The files in pg_xlog
>>> are suspect whereas the files in the archive are not. If we have both we
>>> should prefer the archive.
>
>> Yep.
>
> Really? That will result in a
Heikki Linnakangas writes:
> Simon Riggs wrote:
>> We might also have written half a file many times. The files in pg_xlog
>> are suspect whereas the files in the archive are not. If we have both we
>> should prefer the archive.
> Yep.
Really? That will result in a change in the longstanding be
Simon Riggs wrote:
> On Thu, 2010-03-18 at 23:27 +0900, Fujii Masao wrote:
>
>> I agree that this is a bigger problem. Since the standby always starts
>> walreceiver before replaying any WAL files in pg_xlog, walreceiver tries
>> to receive the WAL files following the REDO starting point even if t
On Thu, 2010-03-18 at 23:27 +0900, Fujii Masao wrote:
> I agree that this is a bigger problem. Since the standby always starts
> walreceiver before replaying any WAL files in pg_xlog, walreceiver tries
> to receive the WAL files following the REDO starting point even if they
> have already been in
On Wed, Mar 17, 2010 at 7:35 PM, Heikki Linnakangas
wrote:
> Fujii Masao wrote:
>> I found another missing feature in new file-based log shipping (i.e.,
>> standby_mode is enabled and 'cp' is used as restore_command).
>>
>> After the trigger file is found, the startup process with pg_standby
>> tr
On Wed, 2010-03-17 at 12:35 +0200, Heikki Linnakangas wrote:
> Looking into this, I realized that we have a bigger problem...
A lot of this would be easier if you do the docs first, then work
through the problems. The new system is more complex, since it has two
modes rather than one and also mul
Fujii Masao wrote:
> I found another missing feature in new file-based log shipping (i.e.,
> standby_mode is enabled and 'cp' is used as restore_command).
>
> After the trigger file is found, the startup process with pg_standby
> tries to replay all of the WAL files in both pg_xlog and the archive
On Fri, Feb 12, 2010 at 2:29 AM, Heikki Linnakangas
wrote:
> So the only major feature we're missing is the ability to clean up old
> files.
I found another missing feature in new file-based log shipping (i.e.,
standby_mode is enabled and 'cp' is used as restore_command).
After the trigger file
On Sat, Feb 13, 2010 at 1:10 AM, Heikki Linnakangas
wrote:
> Are you thinking of a scenario where remove_command gets stuck, and
> prevents bgwriter from performing restartpoints while it's stuck?
Yes. If there is the archive in the remote server and the network outage
happens, remove_command mig
Simon Riggs writes:
> Attached patch implements pg_standby for use as an
> archive_cleanup_command, reusing existing code with new -a option.
>
> Happy to add the archive_cleanup_command into main server as well, if
> you like. Won't take long.
Would it be possible to have the server do the clean
Fujii Masao wrote:
> On Fri, Feb 12, 2010 at 10:10 PM, Heikki Linnakangas
> wrote:
>>> So I suggest that you have a new action that gets called after every
>>> checkpoint to clear down the archive. It will remove all files from the
>>> archive prior to %r. We can implement that as a sequence of un
On Fri, Feb 12, 2010 at 10:10 PM, Heikki Linnakangas
wrote:
>> So I suggest that you have a new action that gets called after every
>> checkpoint to clear down the archive. It will remove all files from the
>> archive prior to %r. We can implement that as a sequence of unlink()s
>> from within the
On Fri, 2010-02-12 at 12:54 +, Simon Riggs wrote:
> So I suggest that you have a new action that gets called after every
> checkpoint to clear down the archive. It will remove all files from the
> archive prior to %r. We can implement that as a sequence of unlink()s
> from within the server, o
Simon Riggs wrote:
> In 8.4 it is pg_standby that was responsible for clearing down the
> archive, which is why I suggested using pg_standby for that again. I
> agree that will not work. The important thing is not pg_standby but that
> we have a valid mechanism for clearing down the archive.
Good
On Fri, 2010-02-12 at 14:38 +0900, Fujii Masao wrote:
> On Thu, Feb 11, 2010 at 11:22 PM, Heikki Linnakangas
> wrote:
> > Simon Riggs wrote:
> >> Might it not be simpler to add a parameter onto pg_standby?
> >> We send %s to tell pg_standby the standby_mode of the server which is
> >> calling it s
Simon Riggs wrote:
> On Thu, 2010-02-11 at 13:08 -0500, Tom Lane wrote:
>> Heikki Linnakangas writes:
>>> -1. it isn't necessary for PITR. It's a new requirement for
>>> standby_mode='on', unless we add the file size check into the backend. I
>>> think we should add the file size check to the back
On Thu, Feb 11, 2010 at 11:22 PM, Heikki Linnakangas
wrote:
> Simon Riggs wrote:
>> Might it not be simpler to add a parameter onto pg_standby?
>> We send %s to tell pg_standby the standby_mode of the server which is
>> calling it so it can decide how to act in each case.
>
> That would work too,
On Thu, 2010-02-11 at 19:29 +0200, Heikki Linnakangas wrote:
> Aidan Van Dyk wrote:
> > * Heikki Linnakangas [100211 09:17]:
> >
> >> Yeah, if you're careful about that, then this change isn't required. But
> >> pg_standby protects against that, so I think it'd be reasonable to have
> >> the same
On Thu, Feb 11, 2010 at 01:22:44PM -0500, Kevin Grittner wrote:
> Heikki Linnakangas wrote:
>
> > I think 'rsync' has the same problem.
>
> There is a switch you can use to create the problem under rsync, but
> by default rsync copies to a temporary file name and moves the
> completed file to
Heikki Linnakangas wrote:
> I think 'rsync' has the same problem.
There is a switch you can use to create the problem under rsync, but
by default rsync copies to a temporary file name and moves the
completed file to the target name.
-Kevin
--
Sent via pgsql-hackers mailing list (pgsql-hack
On Thu, 2010-02-11 at 13:08 -0500, Tom Lane wrote:
> Heikki Linnakangas writes:
> > -1. it isn't necessary for PITR. It's a new requirement for
> > standby_mode='on', unless we add the file size check into the backend. I
> > think we should add the file size check to the backend instead and save
>
Aidan Van Dyk wrote:
> * Heikki Linnakangas [100211 12:04]:
>
>>> But it can be a problem - without the last WAL (or at least enough of
>>> it) the master switched and archived, you have no guarantee of having
>>> being consistent again (I'm thinking specifically of recovering from a
>>> fresh ba
Heikki Linnakangas writes:
> -1. it isn't necessary for PITR. It's a new requirement for
> standby_mode='on', unless we add the file size check into the backend. I
> think we should add the file size check to the backend instead and save
> admins the headache.
I think the file size check needs to
* Heikki Linnakangas [100211 12:04]:
> > But it can be a problem - without the last WAL (or at least enough of
> > it) the master switched and archived, you have no guarantee of having
> > being consistent again (I'm thinking specifically of recovering from a
> > fresh backup)
>
> You have to wa
Aidan Van Dyk wrote:
> * Heikki Linnakangas [100211 09:17]:
>
>> Yeah, if you're careful about that, then this change isn't required. But
>> pg_standby protects against that, so I think it'd be reasonable to have
>> the same level of protection built-in. It's not a lot of code.
>
> This 1 check
Aidan Van Dyk wrote:
> * Heikki Linnakangas [100211 09:17]:
>
>> If the file is just being copied to the archive when restore_command
>> ('cp', say) is launched, it will copy a half file. That's not a problem
>> for PITR, because PITR will end at the end of valid WAL anyway, but
>> returning a ha
Simon Riggs escreveu:
> It would mean that pg_standby would act appropriately according to the
> setting of standby_mode. So you wouldn't need multiple examples of use,
> it would all just work whatever the setting of standby_mode. Nice simple
> entry in the docs.
>
+1. I like the %s idea. IMHO fi
Simon Riggs wrote:
> On Thu, 2010-02-11 at 16:22 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> Might it not be simpler to add a parameter onto pg_standby?
>>> We send %s to tell pg_standby the standby_mode of the server which is
>>> calling it so it can decide how to act in each case.
Heikki Linnakangas wrote:
Simon Riggs wrote:
Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.
That would work too, but it doesn't seem any simpler to me
* Heikki Linnakangas [100211 09:17]:
> If the file is just being copied to the archive when restore_command
> ('cp', say) is launched, it will copy a half file. That's not a problem
> for PITR, because PITR will end at the end of valid WAL anyway, but
> returning a half WAL file in standby mode i
On Thu, 2010-02-11 at 16:22 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Might it not be simpler to add a parameter onto pg_standby?
> > We send %s to tell pg_standby the standby_mode of the server which is
> > calling it so it can decide how to act in each case.
>
> That would work t
Simon Riggs wrote:
> Might it not be simpler to add a parameter onto pg_standby?
> We send %s to tell pg_standby the standby_mode of the server which is
> calling it so it can decide how to act in each case.
That would work too, but it doesn't seem any simpler to me. On the contrary.
--
Heikki
Aidan Van Dyk wrote:
> But colour me confused, I'm still not understanding why this is any
> different that with normal PITR recovery.
>
> So even with a plain "cp" in your recovery command instead of a
> sleep+copy (a la pg_standby, or PITR tools, or all the home-grown
> solutions out thery), I'm
On Thu, 2010-02-11 at 15:55 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > One question then: how do we ensure that the archive does not grow too
> > big? pg_standby cleans down the archive using %R. That function appears
> > to not exist anymore.
>
> You can still use %R. Of course, p
* Heikki Linnakangas [100211 08:29]:
> To suppport a restore_command that does the sleeping itself, like
> pg_standby, would require a major rearchitecting of the retry logic. And
> I don't see why that'd desirable anyway. It's easier for the admin to
> set up using simple commands like 'cp' or
Simon Riggs wrote:
> One question then: how do we ensure that the archive does not grow too
> big? pg_standby cleans down the archive using %R. That function appears
> to not exist anymore.
You can still use %R. Of course, plain 'cp' won't know what to do with
it, so a script will then be require
On Thu, 2010-02-11 at 14:41 +0100, Dimitri Fontaine wrote:
> Simon Riggs writes:
> > If you were running pg_standby as the restore_command then this error
> > wouldn't happen. So you need to explain why running pg_standby cannot
> > solve your problem and why we must fix it by replicating code tha
On Thu, 2010-02-11 at 15:28 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > If you were running pg_standby as the restore_command then this error
> > wouldn't happen. So you need to explain why running pg_standby cannot
> > solve your problem and why we must fix it by replicating code tha
Simon Riggs writes:
> If you were running pg_standby as the restore_command then this error
> wouldn't happen. So you need to explain why running pg_standby cannot
> solve your problem and why we must fix it by replicating code that has
> previously existed elsewhere.
Let me try.
pg_standby will
Simon Riggs wrote:
> If you were running pg_standby as the restore_command then this error
> wouldn't happen. So you need to explain why running pg_standby cannot
> solve your problem and why we must fix it by replicating code that has
> previously existed elsewhere.
pg_standby cannot be used with
On Thu, 2010-02-11 at 14:44 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
> Hmm, so after running restore_command, check the file
Simon Riggs wrote:
> On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_comman
On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
> >> Hmm, so after running restore_command, check the file size and if it's
> >> too short, treat it the same as if restore_command returned non-zero?
>
Simon Riggs wrote:
> On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
>> Hmm, so after running restore_command, check the file size and if it's
>> too short, treat it the same as if restore_command returned non-zero?
>> And it will be retried on the next iteration. Works for me, though
On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
> Fujii Masao wrote:
> > As I pointed out previously, the standby might restore a partially-filled
> > WAL file that is being archived by the primary, and cause a FATAL error.
> > And this happened in my box when I was testing the SR.
> >
Aidan Van Dyk wrote:
> * Heikki Linnakangas [100210 02:33]:
>
>> Hmm, so after running restore_command, check the file size and if it's
>> too short, treat it the same as if restore_command returned non-zero?
>> And it will be retried on the next iteration. Works for me, though OTOH
>> it will t
* Heikki Linnakangas [100210 02:33]:
> Hmm, so after running restore_command, check the file size and if it's
> too short, treat it the same as if restore_command returned non-zero?
> And it will be retried on the next iteration. Works for me, though OTOH
> it will then fail to complain about a
72 matches
Mail list logo