On Sat, Feb 11, 2017 at 10:38 AM, Michael Banck <michael.ba...@credativ.de>
wrote:

> Hi,
>
> one take-away from the Gitlab Post-Mortem[1] appears to be that after
> their secondary lost replication, they were confused about what
> pg_basebackup was doing when they tried to rebuild it. It just sat there
> and did nothing (even with --verbose), so they assumed something was
> wrong with either the primary or the connection, and restarted it
> several times.
>
> AFAICT, it turns out the checkpoint was written on the master (they
> probably did not use -c fast), but this wasn't obvious to them:
>


Yeah, I've seen this happen to a number of people. I think that sounds like
what's happened here as well. I've considered things in the line of the
patch you posted, but never got around to actually doing anything about it.



> ISTM that even with WAL streaming, nothing would be written on the
> client server until the checkpoint is complete, as do_pg_start_backup()
> runs the checkpoint and only returns the starting WAL location
> afterwards.
>
> The attached (untested) patch is to kick of a discussion on how to
> improve the situation, it is supposed to mention the checkpoint when
> --verbose is used and adds a paragraph about the checkpoint being run to
> the Notes section of the documentation.
>
>
Docs look good to me, other than claiming that pg_basebackup runs on a
server (it can run anywhere). I would just say "during which pg_basebackup
will appear idle". How does that sound to you?

As for the code, while I haven't tested it, isn't the "checkpoint
completed" message in the wrong place? Doesn't PQsendQuery() complete
immediately, and the check needs to be put *after* the PQgetResult() call?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Reply via email to